wizacode / php-etl Goto Github PK
View Code? Open in Web Editor NEWThis project forked from leomarquine/php-etl
Extract, Transform and Load data using PHP.
License: MIT License
This project forked from leomarquine/php-etl
Extract, Transform and Load data using PHP.
License: MIT License
Does anyone have a complete example, no matter how simple, which includes database connections? I am unable to figure out how to get my database configuration into the Table
class via the Manager
and ConnectionFactory
so that my program can actually read one database and write another.
See here
Right now only Loader will take notice of discarded rows.
Has anyone encountered this problem when there is no match between two sources that the resulting csv file has the columns messed up?
PS: Data is extracted and aggregated from 2 queries, not csv files and here are the two data sources:
1.
"id";"cId"
3260311;5
3260312;4
3260313;3
2.
"id";"email"
3260312;"[email protected]";
3260315;"[email protected]";
Thx.
See this PR in the original repo : leomarquine#56
I wonder if we could implement it too.
@Nicolas-Masson-Wizaplace @kdebisschop If you have any opinion on that.
I've been working on an Extractor class for Google Analytics. It is all set to the extent that it works fine. I have not been able to think of a way to unit test it meaningfully, but that is all that remains for work to be done.
Initially I had thought that it made sense to contribute here as a new Extractor. But on reflection, I'm not so sure.
It would add a limited-use dependency on google/apiclient that all users would incur, even though relatively few are likely to use it.
I am thinking a separate repo for limited use plugins makes more sense. I wonder what other people think.
And change the link in the MIT licence badge on the README.md file.
With the following input sources:
{
"id": "1",
"type": "premium",
"subscriptionNumber": "p123"
},
{
"id": "2",
"type": "standard"
},
{
"id": "3",
"type": "gold",
"subscriptionNumber": "g234"
}
]
1;John;Doe;[email protected]
2;Jane;Does;[email protected]
3;Bart;Simpson;[email protected]
I would like extract data into a CSV, with the following columns :
The result is the following:
1;[email protected];premium;p123
3;[email protected];gold;g234
The code works as expected, but it throws an exception because there is one invalid row. This is currently the only way to discard invalid rows.
Could we consider to add a flag or something to be able to keep discarding invalid rows but without throwing an exception, because in some use cases, it could be normal to have invalid rows.
For headless CSV data. ex.:
1;John;Doe;[email protected]
2;Jane;Does;[email protected]
3;Bart;Simpson;[email protected]
Add an header option to the Csv extractor class
(
new Etl()
)
->extract(
new Csv(),
"/path/to/csv.file",
[
'delimiter' => ';',
'headers' => [
'id',
'name',
'surname',
'email'
]
]
)
Hello everyone,
I just found out about your php etl library and I was wondering if it's possible to use RenameColumns()
from transform for renaming column on a json source that has more than one depth ?
For exemple, following this json file;
[
{
"id": "110",
"person": [
{
"idPerson": "89",
"addresses": {
"home": [
{
"name": "10 rue des châtaigniers",
"country": "France"
}
],
"work": [
{
"name": "12 rue des bois",
"country": "Belgium"
}
]
}
}
]
}
]
I want to load only the id, the idPerson and just the country of the "home" address that corresponding to my database:
$etl->extract($extractor, 'fakeDataSource/personAddress.json')
->transform($transform, ['columns' => ???) //What dit I need to put here?
->load($loader, 'schema.mytable', ['key' => ['id'], 'columns' => ['id' => 'id', 'id_personne' => 'idPerson', 'pays' => 'country']])
->run();
Thanks in advance for your response, and I'm sorry if I'm in the wrong place to ask this kind of question, so don't hesitate to tell me.
And thank you for this library, it could really save me time on my work, good job ! 😃
Imagine the following use case : we need the data coming form two different sources, for instance two CSV files.
Both have field in common, ex : email.
Open them in Excel, to sort them by email.
Edit the ETL to be able to chain Extractors to get the data from the two sources and merge them in one row. Since the two CSVs are sorted by email, we assume that each line in one file match the one in the other file. But we still thrown an exception if the email doesn't match.
Note that we must be able to specifify the field in common between the two files.
Question : be able to merge n files, instead of only two?
When extracting data from JSON file and loading into PostgreSQL database, array values are not handled correctly.
For exemple when trying to load value ["value1", "value2"]
from JSON file into an varchar[]
field in PostgreSQL database, this error occure :
Array to string conversion
The JSON array should be transformed in PostgreSQL Array format string so it can be properly inserted in database : "{value1, value2}"
.
Example of implementation (does not require doc update):
/**
* Extract data from the input.
*/
public function extract(): \Generator
{
$handle = fopen($this->input, 'r');
$columns = $this->makeColumns($handle);
$rowIndex = 1;
while ($row = fgets($handle)) {
$rowIndex++;
yield new Row($this->makeRow($row, $columns, $rowIndex));
}
fclose($handle);
}
/**
* Converts the row string to array.
*/
protected function makeRow(string $row, array $columns, int $rowIndex): array
{
$row = str_getcsv($row, $this->delimiter, $this->enclosure);
$data = [];
$rowCount = count($row);
$columnsCount = count($columns);
if ($rowCount !== $columnsCount) {
throw new \LogicException("Row with index #{$rowIndex} only contains $rowCount elements while $columnsCount were expected.");
}
foreach ($columns as $column => $index) {
$data[$column] = $row[$index - 1];
}
return $data;
}
In some use case, we might need many instance of the same component (ETL, CSV Extractor...).
So far there is two way to proceed:
We might consider to create a factory. It would be a class with one public method get(), with two parameters:
In another issue, a question was raised about a function's PHPdoc - in particular, the function had been like this
/**
* @param string $a
* @param array $b
*/
public function foo (string $a, $b) {...}
I removed "@param string $a" in order to pass the inspection "no_superfluous_phpdoc_tags" (which is implied by @symfony in the distributed .php_cs.dist file). That raised a question among the reviewers and the statement that one reviewer would like to see PHPdoc type hints for all parameters.
I'm creating this issue as a locus for any follow-up conversation.
If we want to see PHPdoc type hints for all parameters, we would have to follow one of these courses:
I can live with any choice that makes the code and tests comply with the inspection profiles we ship alongside. But I prefer the configuration we have. In my mind, the desired ideal is that nearly all variables are self-documenting and are constrained by php-native types. With the current codebase, we can verify that things like
ETL::load(Loader $loader, string $output, $options = []): Etl {}
can be refactored to
ETL::load(Loader $loader, string $output, array $options = []): Etl
It seems implied by the code, but since that could introduce a breaking change, so it also seems worth explicitly verifying.
There are about as dozen places in the code where the variable type is mixed. With the current inspection profiles, those variables would need to retain their PHPdoc. In all other cases, PHPdoc would be provided for variables only when the variable name was not felt to be adequately self-documenting.
I came across a bug where the CSV extractor assumes the header is a single line (it uses fgets
when parsing the header).
Using the following CSV file:
Column 1,Column 2,"Multi-Line
Column 3"
Value 1,Value 2,Value 3
Running the following code:
(new Etl())
->extract(new Csv(), __DIR__.'/test.csv')
->load($loader, '', ['index' => 'Column 1'])
->run();
I get the following notice when the extractor tries to parse the first row (which is actually the second line of the header):
Notice: Undefined offset: 1
Thanks for the hard work on this library. I'm happy to make a pull request if you're open to that.
I have this code. Working on tests and documentation.
The idea here is to use a Generator callback to yield multiple rows for each row of an input.
In particular, can use $pipeline->extract(...)->toIterator() as input to this Extractor. In such a configuration, two extractors can be chained together to return N x M rows selecting N rows of JSON arrays from an SQL database, each containing M elements in the JSON array.
I have this code and a tests, working on docs now.
This is for the 2.x branch/release. PR will be added shortly.
The work will replace PHPdoc annotations for type with PHP 7.4 property annotations. This is a compatibility-breaking change.
Though we don't have a published code style guide, previous conversations have suggested our desire is to avoid duplications and excess code that do carry information. So I've removed or streamlined parameter comments where appropriate (i.e., most parameters need no doc or only a single line.
Also, in a few places, methods can use @inheritdoc -- I do so when the existing comment just parrots the parent's doc (note: All this plays nicely with phpDocumentor. It seems that using @inheritdoc for inherited properties does not, so I have left comments there even if they are largely redundant.
Also, since our php_cs_fixer profile uses symfony, and since most of our variables are in camelCase, I refactored the handful of variables in snake_case so the project would be consistent overall.
In the absence of a formal style guide, I am definitely making assumptions about what our desired style looks like. As I result, I expect and welcome comment.
When 'columns' => []
in Insert loader, all columns are selected.
When 'columns' => []
in InsertUpdate loader, no columns are selected.
I think InsertUpdate is wrong and
if (is_array($this->columns) ?? [] !== $this->columns) {
should be
if (is_array($this->columns) && [] !== $this->columns) {
This bug will more or less automatically get resolved when we start using PHP 7.4 type hint for properties.
But I have a simple patch for the current code if we want to apply it...
After refactoring for PHP 7.4, the current function shouldBeginTransaction() in \Wizaplace\Etl\DatabaseTransaction is:
return !$this->open && (0 === $this->size || 1 === $this->count);
It's not clear to me that 0 === $this->size
should be part of that condition and in fact
return !$this->open && 1 === $this->count;
also passes all our current tests. Either the tests are not diagnostic or the code is inaccurate, We should determine which and resolve the inconsistency.
Like RenameColumns, but does not delete the original.
I have this code. Working on tests and documentation.
Improve the error message when there are invalid rows.
$incompletes row(s) were rejected because incomplete
instead of
$incompletes rows
On top of that, there is no unit test to cover this use case.
I am using the xml that has more than one depth. For example
<users>
<user>
<profile>
</profile>
<addresses>
<address></addresses>
</addresses>
</user>
</users>
I need to get profile details and address details for each user. How can I extract ? Could anyone give the solution to solve this?
Thank you for forking this and continuing the work!
Wondering if it would be possible to have an option for InsertUpdate to only do an update? That is, if the item isn't in the table then do not insert?
We've got one ETL that populates the table from an API service. But then another API service reports which items were deleted (but only the index). If for some reason an item was deleted before it was extracted then it wouldn't exist in our table. In this case we don't want it to try to insert it, just skip. Is this possible?
It is not a normal ETL construct by any means, but what do you think of the idea of creating a transformer that periodically outputs some sort of progress indicator (and does nothing else)?
[Feeling a bit annoyed with myself for nit having a better way than counting rows in a database to measure progress :-( ]
With Postgresql database, update are not handled correctly in some cases.
Working :
Not working :
After some search I found that there is an issue in the update(array $row, array $current)
function of src/Loaders/InsertUpdate.php file.
This section use loose comparison between old values array and new values array :
if ($row == array_intersect_key($current, $row)) { return; }
The $current
array return boolean from database as a PHP Boolean while the $row
array take string('true', 't', 'false', 'f') or NULL value to define booleans.
This prevent database update when it should be.
Hi. Have you planned to implement a validation step? This could be very useful in ensuring that the extracted rows are valid. Thank you.
The CsvLoader
append fileCounter
and the .csv
file extension, in any case (using the linePerFile
option or not).
We force the class consumers to feed an incomplete file path.
Ex.: input file name: /path/to/my/file
produce:
/path/to/my/file_0.csv
/path/to/my/file_1.csv
/path/to/my/file_2.csv
In order to let the CsvLoader
class consumers having full control of the file path, i suggest to make 0 silent in the file name generation, and to use pathinfo
for the file name extension extraction.
Ex.: input file name: /path/to/my/file.csv
will produce:
/path/to/my/file.csv
/path/to/my/file_1.csv
/path/to/my/file_2.csv
And/Or using the linePerFile
option to activate or not this first _0
file name suffix.
Maybe related to #2
We added some todo annotation where the extract php function is used. It creates variables from from array keys, and it implies to check for their existence. This need to be refactored.
Create two CSV splitter class.
One splitting the CSV in n parts.
One splitting the CSV in n max lines.
Be careful with the multilines CSV.
Need documentation update.
So far we only have SQL Loaders.
We would need to create a basic CSV one, enabling to ETL to CSV.
Need documentation update.
Using this library Can we Extract data from MSSQL database and load it into a csv file?
We call for PSR12 in phpcs. but the tests do not meet that standard -- primarily due to snake-case test names.
I'm happy to create a patch to address this. It should not create any backward compatibility issues and could land as a 1.2.x release or as 2.x if we decide to create a new major release.
In the CSV extractor, if the file is not found, PHP will issue a warning.
We need to hide the failure (with the '@' character) to throw a real exception, easier to handle for the calling code.
The current CSV extractor seems to not work with files with entries which are not in only one line (multiline entries).
Need documentation update.
Is it possible to add constants to CSV extract?, i.e. 'USER' => '123'
Earlier versions of MariaDB were more forgiving, but now I'm getting a syntax error because I have a column that is a reserved word. Oops. The quick fix is to change the column name, but I thought I'd bring it up here so that php-etl could put backticks around columns when creating the queries. This may avoid the error.
So far these tests don't test anything, as Row
s in $expected
array are strictly the same references to Row
object than in $data
. $expected
need to be fill with proper new Row
s to make any proper equality assertion.
Following @ArthurHoaro comment:
I didn't really take a deep look at the implementation, but it looks like that there is a lot of code for SQL query building.
As a further improvement, maybe it would make sense to use a proper SQL query builder as a dependency rather than implementing one in this lib.
Experience of development using this library has shown the redundancy of usage of keys, like for instance 'columns', 'delimiter'...
The idea would be to use define and use constants for that.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.