zhongjiajie / datax Goto Github PK
View Code? Open in Web Editor NEWThis project forked from alibaba/datax
基于alibaba/DataX修改的一个fork
License: Other
This project forked from alibaba/datax
基于alibaba/DataX修改的一个fork
License: Other
你好,请问怎么使用DATAX读取parquet格式的数据呢?
源码编译后 进入target/datax/datax,运行 python bin/datax.py job/job.json
报错:
2020-09-25 20:27:24.531 [main] INFO ErrorRecordChecker - percentage使用标准的百分比(配置值忽略百分号),如 [45.45%] 的配置为:"percentage": 45.45
2020-09-25 20:27:24.532 [main] INFO ErrorRecordChecker - 配置了 errorLimit.record, 其优先级高于 errorLimit.percentage 会将覆盖 errorLimit.percentage
2020-09-25 20:27:24.533 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-09-25 20:27:24.534 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-09-25 20:27:24.535 [main] INFO JobContainer - DataX jobContainer starts job.
2020-09-25 20:27:24.537 [main] INFO JobContainer - Set jobId = 0
2020-09-25 20:27:24.559 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2020-09-25 20:27:24.560 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .
2020-09-25 20:27:24.561 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2020-09-25 20:27:24.561 [job-0] INFO JobContainer - jobContainer starts to do split ...
2020-09-25 20:27:24.568 [job-0] ERROR JobContainer - Exception when job run
com.alibaba.datax.common.exception.DataXException: Code:[Framework-03], Description:[DataX引擎配置错误,该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]. - 在有总bps限速条件下,单个channel的bps值不能为空,也不能为非正数
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26) ~[datax-common-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.job.JobContainer.adjustChannelNumber(JobContainer.java:430) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.job.JobContainer.split(JobContainer.java:387) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:117) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.Engine.start(Engine.java:92) [datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.Engine.entry(Engine.java:171) [datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.Engine.main(Engine.java:204) [datax-core-0.0.1-SNAPSHOT.jar:na]
2020-09-25 20:27:24.576 [job-0] INFO StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 0.00%
2020-09-25 20:27:24.577 [job-0] ERROR Engine -
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Framework-03], Description:[DataX引擎配置错误,该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]. - 在有总bps限速条件下,单个channel的bps值不能为空,也不能为非正数
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)
at com.alibaba.datax.core.job.JobContainer.adjustChannelNumber(JobContainer.java:430)
at com.alibaba.datax.core.job.JobContainer.split(JobContainer.java:387)
at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:117)
at com.alibaba.datax.core.Engine.start(Engine.java:92)
at com.alibaba.datax.core.Engine.entry(Engine.java:171)
at com.alibaba.datax.core.Engine.main(Engine.java:204)
谢谢
输出信息
2020-12-17 19:52:52.083 [0-0-0-writer] ERROR StdoutPluginCollector - 脏数据:
{"message":"status:[400], error: {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [created_at] of type [date] in document with id '11'. Preview of field's value: '2020-10-28T11:16:06.000+08:00'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"failed to parse date field [2020-10-28T11:16:06.000+08:00] with format [yyyy-MM-dd HH:mm:ss]\",\"caused_by\":{\"type\":\"date_time_parse_exception\",\"reason\":\"Text '2020-10-28T11:16:06.000+08:00' could not be parsed at index 10\"}}}","record":[{"byteSize":2,"index":0,"rawData":11,"type":"LONG"},{"byteSize":7,"index":1,"rawData":"手机端淘宝首页","type":"STRING"},{"byteSize":5,"index":2,"rawData":15369,"type":"LONG"},{"byteSize":4,"index":3,"rawData":2929,"type":"LONG"},{"byteSize":8,"index":4,"rawData":1603854966000,"type":"DATE"}],"type":"writer"}
2020-12-17 19:52:52.085 [0-0-0-writer] ERROR StdoutPluginCollector - 脏数据:
{"message":"status:[400], error: {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [created_at] of type [date] in document with id '16'. Preview of field's value: '2020-10-28T11:16:06.000+08:00'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"failed to parse date field [2020-10-28T11:16:06.000+08:00] with format [yyyy-MM-dd HH:mm:ss]\",\"caused_by\":{\"type\":\"date_time_parse_exception\",\"reason\":\"Text '2020-10-28T11:16:06.000+08:00' could not be parsed at index 10\"}}}","record":[{"byteSize":2,"index":0,"rawData":16,"type":"LONG"},{"byteSize":4,"index":1,"rawData":"过年海报","type":"STRING"},{"byteSize":6,"index":2,"rawData":104630,"type":"LONG"},{"byteSize":4,"index":3,"rawData":5888,"type":"LONG"},{"byteSize":8,"index":4,"rawData":1603854966000,"type":"DATE"}],"type":"writer"}
2020-12-17 19:52:52.086 [0-0-0-writer] ERROR StdoutPluginCollector - 脏数据:
配置
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "",
"password": "",
"connection": [
{
"querySql": [
"select id,keyword,scount,icount,created_at from t"
],
"jdbcUrl": [
""
]
}
]
}
},
"writer": {
"name": "elasticsearchwriter",
"parameter": {
"endpoint": "http://127.0.0.1:9200",
"index": "aaa",
"type": "_doc",
"cleanup": false,
"discovery": false,
"batchSize": 1000,
"splitter": ",",
"dynamic": true,
"column" : [
{"name": "id", "type": "id"},
{"name": "keyword", "type": "text", "analyzer": "ccc"},
{"name": "scount", "type": "integer"},
{"name": "icount", "type": "integer"},
{"name": "created_at", "type": "date", "fromFormat": "yyyy-MM-dd HH:mm:ss"}
]
}
}
ES
"mappings": {
"properties": {
"icount": {
"type": "integer"
},
"scount": {
"type": "integer"
},
"created_at": {
"format": "yyyy-MM-dd HH:mm:ss",
"type": "date"
},
"id": {
"type": "integer"
},
"keyword": {
"analyzer": "ccc",
"type": "text"
}
}
}
数据格式
insert into t ( `icount`, `pinyin`, `scount`, `created_at`, `keyword`, `updated_at`) values ( '253300', 'beijing', '14432285', '2020-10-28 11:16:06', '背景', '2020-10-28 11:16:06');
这个问题你刚刚关闭了,就是elasticsearchwriter把yyyy-mm-dd的string格式字段变为date,但是同步的某些数据是空的,date写不进去就变脏数据了
目前默认的脏数据采集是使用 StdoutPluginCollector 将脏数据打印到 terminal, 新的LocalFilePluginCollector 会在 terminal 中打印的同时,将脏数据记录到本地文件中。路径应该参照日志文件的路径
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.