higee / elastic Goto Github PK

View Code? Open in Web Editor NEW

130.0 30.0 43.0 365.33 MB

Elastic Stack (6.2.4) 을 활용한 Dashboard 만들기 Project

Python 100.00%

elastic logstash kibana elasticsearch elasticstack dashboard beats elk elk-stack

elastic's People

Contributors

Stargazers

Watchers

elastic's Issues

logstash file input plugin 사용시 file 내용 변경

테스트 환경 설정

우선 아래와 같은 date.log 파일이 있다고 하자.

2018-02-12T16:03:38 Ben
2018-02-13T03:25:31 John
2018-02-14T13:31:11 Leo

이 때 아래와 같은 test.conf로 실행해보자

input {
  file {
    path => "/usr/share/logstash/data/date.log"
    start_position => "beginning"
    sincedb_path => "/usr/share/logstsah/test.db"
  }
}

filter {
  mutate {
    remove_field => ["@version", "host", "path", "@timestamp"]
  }
}

output {
  stdout {
    codec => rubydebug 
  }
}

그러면 아래와 같은 결과가 나온다

{
       "message" => "2018-02-12T16:03:38 Ben"
}
{
       "message" => "2018-02-13T03:25:31 John"
}
{
       "message" => "2018-02-14T13:31:11 Leo"
}

sincedb로 지정했던 test.db 내부를 보자.

$ cat test.db
>>>
2331699 0 51713 73

이 때 마지막 73이 current_byte_offset에 해당하는 값으로 현재 어디까지 조회했는지에 대한 정보다.
다만 이 값은 key, value 형태가 아니라 어디까지 봤는지에 대한 위치 정보 같은 값이므로 date.log를 변경함에 따라 결과도 달라진다. 아래와 같은 조건으로 date.log를 변경하면서 logstash 결과를 확인하자.

테스트

1. 마지막 라인에 내용 추가

date.log를 아래와 같이 수정해보자

2018-02-12T16:03:38 Ben
2018-02-13T03:25:31 John
2018-02-14T13:31:11 Leo
2018-02-15T23:31:11 Tom

그리고 실행하자
$ bin/logstash -f test.conf
결과는 아래와 같다 : 생각한대로 새로 추가된 내용만 출력됐다

{
    "message" => "2018-02-15T23:31:11 Tom"
}

sincedb를 확인해보자 : 4번째 field만 변경된 걸 확인할 수 있다

$ cat test.db 
>>>
2331699 0 51713 97

결과해석 : 마지막 라인에 새로운 데이터가 추가되는 가장 일반적인 use case이므로 정상 작동한다

2. 가장 윗 라인에 내용 추가

우선 test.db 파일을 아래와 같이 원상 복구하자

$ echo '2331699 0 51713 73' > test.db
$ cat test.db
>>>
2331699 0 51713 73

date.log를 아래와 같이 수정해보자

2018-02-11T16:03:38 Jay
2018-02-12T16:03:38 Ben
2018-02-13T03:25:31 John
2018-02-14T13:31:11 Leo

그리고 실행하자
$ bin/logstash -f test.conf
결과는 아래와 같다

{
    "message" => "2018-02-14T13:31:11 Leo"
}

sincedb를 확인하자

$ cat test.db
>>>
2331699 0 51713 97

결과해석 : 아래와 같이 sincedb는 데이터 값이 아니라 위치로 기억하므로 의도한 대로 출력되지 않는다

테스트 전 환경 설정

2018-02-12T16:03:38 Ben
2018-02-13T03:25:31 John
2018-02-14T13:31:11 Leo       # <<< 세번째 라인 끝까지 읽었다고 기록

테스트 시 변경한 파일

2018-02-11T16:03:38 Jay     # <<< 새로 추가된 라인
2018-02-12T16:03:38 Ben
2018-02-13T03:25:31 John  # <<< sincedb는 여기까지 읽었다고 기록
2018-02-14T13:31:11 Leo

3.1 가장 마지막 라인 내용 수정

우선 test.db 파일을 아래와 같이 원상 복구하자

$ echo '2331699 0 51713 73' > test.db
$ cat test.db
>>>
2331699 0 51713 73

date.log를 아래와 같이 수정해보자

2018-02-12T16:03:38 Ben
2018-02-13T03:25:31 John
2018-02-14T13:31:11 NEW

그리고 실행하자
$ bin/logstash -f test.conf
결과는 아래와 같다

sincedb를 확인하자

$ cat test.db
>>>
2331699 0 51713 73

결과해석 : 내용이 수정되었지만 같은 글자수만큼 수정되었으므로 차이가 없어 새로 읽어들이지 않는다

3.2 가장 마지막 라인 내용 수정

우선 test.db 파일을 아래와 같이 원상 복구하자

$ echo '2331699 0 51713 73' > test.db
$ cat test.db
>>>
2331699 0 51713 73

date.log를 아래와 같이 수정해보자

2018-02-12T16:03:38 Ben
2018-02-13T03:25:31 John
2018-02-14T13:31:11 NEWDATA

그리고 실행하자
$ bin/logstash -f test.conf
결과는 아래와 같다

{
    "message" => "ATA"
}

sincedb를 확인하자

$ cat test.db
>>>
2331699 0 51713 77

결과해석 : 내용이 수정되었고 기존 값 대비 글자수도 추가되었으므로 그 부분부터 출력되었다

Scripted Field는 언제 실행되는지?

기본적으로 사용될 때 Scripted Field가 사용될 때 실행된다. 다만 Scripted Field를 호출하는 방식에 따라 차이가 있다.

doc['my_field'].value

사용할 데이터를 메모리에 띄우므로 캐싱된다
속도가 빠르다
메모리 사용량이 커진다

params['_source']['my_field']

사용할 때 마다 읽어오고 파싱 작업을 해야한다
속도가 느리다

Source

elastic reference - script fields

Terms aggregations에서 Order by Terms 사용시 정렬 순서?

~~unicode value로 정렬한다~~

고 짐작합니다.
명확히 파악한 후 수정하겠습니다

정렬 기준이 `unicode value`라고 가정

Unicode Name과 Value는 아래와 같다

Unicode Value (Range)	Unicode Name
U+1100 - U+11FF	한글
U+0061 - U+007A	영어 소문자
U+0041 - U+005A	영어 대문자
U+0030 - U+0039	숫자

구매사이트 라는 field mapping을 아래와 같이 했다고 하자

PUT shopping
{
  "mappings": {
    "shopping": {
      "properties": {
        "구매사이트": {
          "type": "keyword"
        }
      }
    }
  }
}

그리고 구매사이트 field의 terms를 기준으로 ascending으로 정렬해보자

결과를 보면 Unicode value가 작은 것부터 차례대로 나온 걸 볼 수 있다

Source

timelion에서 여러개의 series 사용시 순서가 영향이 있는지?

우선 편의상 아래와 같이 default 설정을 하고 시작한다.

index=nginx-*
timefield=@timestamp

default timelion

.es()

시나리오1

.es().label(20이하), 
.es().if(gte, 20, .es(), null).label(20이상)

시나리오2

.es().if(gte, 20, .es(), null).label(20이상), 
.es().label(20이하)

왜 이럴까?

series 1 plot
series 2 plot (단, 겹치는 부분은 뒤에 오는 series가 덮어 쓴다)
예시
- 시나리오1의 series1
- 시나리오1의 series2

x-pack의 기능과 가격은?

주요 기능은 다음과 같다

Security
Monitoring
Alerting
Machine Learning
Graph 분석 및 시각화
Reporting

상세한 기능 및 가격의 정확한 내용은 공식 홈페이지에서 확인 가능하다.

microsecond 단위 date도 사용 가능한지?

입력은 가능하나 sorting 등은 사용할 수 없다

elasticsearch 6.2에서 사용 가능한 date type의 format

자세히 보면 단위가 millisecond까지인 걸 확인할 수 있다
그렇다면 millisecond 이하 microsecond 등의 format은 사용할 수 없는걸까?
예를 들어 2018-07-05 04:24:56.170243를 색인한다고 하자

방법1

mapping

PUT default-format
{
  "mappings": {
    "_doc": {
      "properties": {
        "@timestamp" : {
          "type" : "date"
        }
      }
    }
  }
}

indexing

POST default-format/_doc
{
  "@timestamp" : "2018-07-05 04:24:56.170243"
}

result : format error가 발생하여 색인이 되지 않는다

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [@timestamp]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [@timestamp]",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Invalid format: \"2018-07-05 04:24:56.170243\" is malformed at \" 04:24:56.170243\""
    }
  },
  "status": 400
}

방법2

mapping

PUT customized-format
{
  "mappings": {
    "_doc": {
      "properties": {
        "@timestamp" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss.SSSSSS"
        }
      }
    }
  }
}

indexing

POST customized-format/_doc
{
  "@timestamp" : "2018-07-05 04:24:56.170243"
}

result : 색인이 잘 된다

{
  "_index": "customized-format",
  "_type": "_doc",
  "_id": "55TFZmQByNsCKuKnb22Y",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

주의할 점

elasticsearch는 기본적으로 millisecond 단위까지만 제공한다
그러므로 위의 방법2와 같이 하면 색인은 되나 microsecond 단위로 비교하기는 어렵다
예시 : 아래와 같은 Documents 2개를 색인하자

POST customized-format/_doc
{
  "@timestamp" : "2018-07-05 04:24:56.170243"
}

POST customized-format/_doc
{
  "@timestamp" : "2018-07-05 04:24:56.170244"
}

검증

Kibana : millisecond 이하로는 구분할 수 없어 2개의 Document를 시간순으로 정렬할 수 없다

Elasticsearch : sort를 해도 millisecond 이하로는 대소를 비교할 수 없다

query

GET customized-format/_search
{
  "query": {
    "match_all": {
    }
  },
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ]
}

result : Document 2개의 sort 값이 같은 걸 확인할 수 있다

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "customized-format",
        "_type": "_doc",
        "_id": "CZTJZmQByNsCKuKnu25J",
        "_score": null,
        "_source": {
          "@timestamp": "2018-07-05 04:24:56.170244"
        },
        "sort": [
          1530764696170
        ]
      },
      {
        "_index": "customized-format",
        "_type": "_doc",
        "_id": "B5TJZmQByNsCKuKnsG6f",
        "_score": null,
        "_source": {
          "@timestamp": "2018-07-05 04:24:56.170243"
        },
        "sort": [
          1530764696170
        ]
      }
    ]
  }
}

Source

API 내부에 주석을 넣을 수 없는지?

가능하다

Kibana (Dev Tools)

API

방법1

GET nginx-*/_search
{
  "query": {
    "match_all": {} /* 모든 Documents 조회 */
  }
}

방법2

GET nginx-*/_search
{
  "query": {
    "match_all": {} // 모든 Documents 조회 
  }
}

Result

cURL

API

$ curl -XGET "http://localhost:9200/nginx-*/_search" -H 'Content-Type: application/json' -d'
{
  "size": 1,
  "query": {
    "match_all": {} /* 모든 Documents 조회 */
  }
}' | jq

Result

Luene Query Syntax로 공백은 어떻게 catch하는지?

Lucene Query Syntax를 이용해서 특정 phrase로 시작하는 Documents를 검색한다고 하자.

Field : nginx.access.geoip.country_name
조건 : "Republic of"로 시작하는 Documents

어떻게 하면 될까?

~~옵션1~~

Lucene Query Syntax : nginx.access.geoip.country_name:Republic*of*
Issues
- Republic of로 시작하는 Documents는 모두 검색한다
- Republic_of, Republicof 등으로 시작하는 Documents도 검색이 되어버린다

옵션2

Lucene Query Syntax : nginx.access.geoip.country_name:Republic\ of*
Issues : 의도한대로 정확히 Republic of로 시작하는 Documents만 검색한다

Terms aggregation에서 case-insensitive하게 정렬하고 싶다면?

normalizer를 사용한다

아래와 같은 index가 있다고 하자

PUT test
{
  "mappings": {
    "test": {
      "properties": {
        "country": {
          "type": "keyword"
        }
      }
    }
  }
}

country field에 아래와 같이 데이터를 넣자

PUT test/test/1
{
  "country" : "China"
}

PUT test/test/2
{
  "country" : "chile"
}

country로 정렬해보자

#22 와 같은 이유로 China가 먼저 오고 chile가 온다.
만약에 case-insensitive하게(=이 경우 대소문자 무시) 결과를 보고 싶으면 아래와 normalizer를 추가한다

PUT test2
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "test2": {
      "properties": {
        "country": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

다시 데이터를 넣자

PUT test2/test2/1
{
  "country" : "China"
}

PUT test2/test2/2
{
  "country" : "chile"
}

결과를 확인하자

단, China와 china와 같은 value가 있을 경우 통합되므로 주의하자

Source

exists query는 Field Value가 Null인 경우 어떻게 처리하나요?

조회하려는 필드에 적어도 한 개의 null이 아닌 값이 있는 Documents를 검색한다

예시

다음과 같은 Documents가 있다고 하자

{ "user": "jane" }
{ "user": "" } 
{ "user": ["jane"] }
{ "user": ["jane", null ] } 
{ "user": null }
{ "user": [] } 
{ "user": [null] } 
{ "foo":  "bar" }

그리고 다음과 같은 query로 검색한다고 해보자

GET /_search
{
  "query": {
    "exists" : { 
      "field" : "user" 
    }
  }
}

이 때 결과는 다음과 같다

document	exists	missing	비고
`{ "user": "jane" }`	o		"jane"이라는 non-null value 존재
`{ "user": "" }`	o		빈 string은 non-null value
`{ "user": ["jane"] }`	o		"jane"이라는 non-null value 존재
`{ "user": ["jane", null] }`	o		"jane"이라는 non-null value 존재
`{ "user": null}`		o	null만 존재
`{ "user": [] }`		o	아무 값도 없으므로 null
`{ "user": [null] }`		o	null만 존재
`{ "foo": "bar" }`		o	"user" field 자체가 없음

단, null-value mapping을 통해 위의 결과가 바뀔 수 있다
- mapping 예시
```
PUT test_index
{
  "mappings": {
    "test_type": {
      "properties": {
        "user": {
          "type": "keyword",
          "null_value": "NULL"
        }
      }
    }
  }
}
```
- 의미 : 명시적으로 null 로 들어온 value를 NULL 이라는 string으로 변환
- 위와 같은 query 결과

document	exists	missing	비고
`{ "user": "jane" }`	o		"jane"이라는 non-null value 존재
`{ "user": "" }`	o		빈 string은 non-null value
`{ "user": ["jane"] }`	o		"jane"이라는 non-null value 존재
`{ "user": ["jane", null] }`	o		"jane"이라는 non-null value 존재
`{ "user": null}`	o		null -> `NULL` 변환되어 non-null value 존재
`{ "user": [] }`		o	명시적으로 null으로 들어온 게 없으므로 null
`{ "user": [null] }`	o		null -> `NULL` 변환되어 non-null value 존재
`{ "foo": "bar" }`		o	"user" field 자체가 없음

명시적으로 null으로 들어왔다는 건 아래 query로 확인 가능하다
- 위에서 만든 test_index에 데이터를 넣어보자
  - 명시적 null
```
PUT test_index/test_type/1
{
"user": null
}
```
  - 명시적이 아닌 null
```
PUT test_index/test_type/2
{
  "user": []
}
```
- 전체 Documents 확인

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "user": []
        }
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "user": null
        }
      }
    ]
  }
}

검색

query

GET test_index/_search
{
  "query": {
    "term": {
      "user": "NULL" 
    }
  }
}

결과 : 위에서 _id가 1번인 document만 출력된다

 {
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "user": null
        }
      }
    ]
  }
}

source

jdbc input 사용시 logstash configuration에 password를 숨길 수 있는지?

가능하다

방법1 logstash password filepath 사용

password 파일

$ cat /usr/share/logstash/password
>>>
fc

logstash configuration

input {
  jdbc {
    jdbc_validate_connection => true
    jdbc_connection_string => "jdbc:mysql://52.78.134.20:3306/fc"
    jdbc_user => "fc"
    jdbc_password_filepath => "/usr/share/logstash/password"
    jdbc_driver_library => "/usr/share/logstash/driver/mysql-connector-java-5.1.36-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    statement => "SELECT * FROM test"
 }
}

output {
  stdout {
  }
}

방법2 environment variable 사용

environment variable 설정
```
$ export password=fc
```

logstash configuration

 input {
   jdbc {
     jdbc_validate_connection => true
     jdbc_connection_string => "jdbc:mysql://52.78.134.20:3306/fc"
     jdbc_user => "fc"
     jdbc_password => "${password}"
     jdbc_driver_library => "/usr/share/logstash/driver/mysql-connector-java-5.1.36-bin.jar"
     jdbc_driver_class => "com.mysql.jdbc.Driver"
     statement => "SELECT * FROM test"
  }
 }

 output {
   stdout {
   }
 }

방법3 Logstash Keystore 이용 (권장)

Keystore 접근용 비밀번호 생성
```
$ export LOGSTASH_KEYSTORE_PASS=higee
```

Keystore 생성

$ bin/logstash-keystore create
>>>
Created Logstash keystore at /usr/share/logstash/config/logstash.keystore

Keystore에 추가

$ bin/logstash-keystore add jdbc_password
>>>
Enter value for jdbc_password: # 여기에 비밀번호 입력 후 엔터
Added 'jdbc_password' to the Logstash keystore.

logstash configuration

 input {
   jdbc {
     jdbc_validate_connection => true
     jdbc_connection_string => "jdbc:mysql://52.78.134.20:3306/fc"
     jdbc_user => "fc"
     jdbc_password => "${jdbc_password}"
     jdbc_driver_library => "/usr/share/logstash/driver/mysql-connector-java-5.1.36-bin.jar"
     jdbc_driver_class => "com.mysql.jdbc.Driver"
     statement => "SELECT * FROM test"
  }
 }

 output {
   stdout {
   }
 }

Source

file input에서 파일을 지웠다가 같은 이름으로 재생성하면 다르게 인식하는지?

아래 3개 정보가 모두 같은 경우에만 같게 인식한다.

The inode number (or equivalent)
The major device number of the file system (or equivalent)
The minor device number of the file system (or equivalent)

다만 일반적으로 major/minor device number of the file system은 일정하니 inode에 집중하면 된다

The inode number (or equivalent)

기본적으로 linux file system에서 파일을 생성하면 아래와 같이 구분되어 저장된다
- filename : 생성한 파일 이름
- inode : 파일 메타데이터 (size, link count, timestamp, ...)
- data : 파일에 담긴 실제 데이터
아래와 같이 파일을 생성한 후 inode를 보자
```
$ touch test1
$ ls -i test1
>>>
14434 test1
```
파일을 삭제하고 같은 이름으로 재생성하고 inode를 보자
```
$ rm test1
$ touch test1
$ ls -i test1
>>>
14434 test1
```
~~파일(test1)을 삭제하고, 다른 파일(test2)을 생성한 후에, 같은 이름(test1)으로 재생성하고 inode를 보자~~
```
$ rm test1
$ touch test2
$ ls -i test2
>>>
14434 test2

$ touch test1
$ ls -i test1
>>>
14437 test1
```

The major/minor device number of the file system (or equivalent)

device

filesystem에서 devices는 device nodes로 접근 가능

device nodes들은 보통 /dev 아래 위치

$ ls -lah /dev
>>>
crw-rw-rw- 1 root   root    1, 3   Feb 23 1999  null   # major device number : 1 minor device number : 3
crw-rw-rw- 1 root   root    1, 5   Feb 23 1999  zero   # major device number : 1 minor device number : 5
crw------- 1 rubini tty     4, 1   Aug 16 22:22 tty1   # major device number : 4 minor device number : 1
crw-rw-rw- 1 root   dialout 4, 64  Jun 30 11:19 ttyS0  # major device number : 4 minor device number : 64
crw-rw-rw- 1 root   dialout 4, 65  Aug 16 00:00 ttyS1  # major device number : 4 minor device number : 65

device driver
- 하나의 device driver가 (일반적으로) 여러개의 device 관리
- 종류
  - character device driver (c) : terminal, keyboard, sound card, ...
  - block device driver (d) : hard disk, RAM, CD-ROM, ...
major device number : 특정 device driver를 식별하기 위해 사용
minor device number : 특정 device driver 중에서 사용할 device를 식별하기 위해 사용

Source

Document의 이전 version 데이터를 조회할 수 있는지?

최신 version 결과만 조회할 수 있을 뿐 이전 version 값을 조회할 수는 없다

데이터 색인

PUT test/test/1
{
  "name" : "higee"
}

Document 조회

GET test/test/1
>>>
{
  "_index": "test",
  "_type": "test",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "name": "higee"
  }
}

데이터 수정

PUT test/test/1
{
  "name" : "higee/elastic"
}

Document 조회

GET test/test/1
>>>
{
  "_index": "test",
  "_type": "test",
  "_id": "1",
  "_version": 2,
  "found": true,
  "_source": {
    "name": "higee/elastic"
  }
}

Source

Elasticsearch Reference - GET API

timelion에서 값이 존재함에도 보이지 않는 이유는?

non-null value가 연속해서 오지 않았기에

미션

특정 interval 동안 documents count가 20개 이상이면 녹색, 20개 미만이면 분홍색으로 line graph 생성

접근법 1

방식 : 전체 그래프를 그리고, documents count가 20 이상인 값만 위에 다른 색으로 덮어쓰는 방법
expression

.es().label(20이하), 
.es().if(gte, 20, .es(), null).label(20이상)

timelion

접근법 2

방식 : documents count가 20 미만인 그래프, 20 이상인 그래프만 그리는 방법
expression

.es().if(lt, 20, .es(), null).label(20이하)
.es().if(gte, 20, .es(), null).label(20이상)

timelion

왜 중간에 끊기는 지점이 발생할까?
- 조건에 따라 null을 생성했는데 null로 인해 이을 점이 사라졌기 때문이다
- 끊어지지 않으려면 연속해서 (=최소한 2개) non-null value가 와야 한다
- 접근법2 timelion에서 끊겨 보이는 값 주변을 보자
  - 끊긴 구간의 특정 점 (point A) : 실제로 값이 존재하는 걸 볼 수 있다
  - point A 바로 전
  - point A 바로 뒤
- 만약 접근법2의 expression을 아래와 같이 수정하면 더욱 확실해 진다
```
.es().if(lt, 20, .es(), 0).label(20이하)
.es().if(gte, 20, .es(), 0).label(20이상)
```
- timelion

Term Query 작성법

Term Query를 사용하다보면 두 가지 방식이 있다는 걸 알 수 있다.

기본형

GET shopping/_search
{
  "query": {
    "term": {
      "상품분류": "셔츠"
    }
  }
}

확장형

GET shopping/_search
{
  "query": {
    "term": {
      "상품분류": {
        "value": "셔츠"
      }
    }
  }
}

결과는 같은데 확장형이 있는 이유는 term query와 함께 parameter를 사용하기 위해서이다. 예를 들어 아래와 같은 쿼리를 보자.

GET shopping/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "상품분류": {
              "value": "셔츠",
              "boost": 2.0 
            }
          }
        },
        {
          "term": {
            "고객주소_시도": "경상남도" 
          }
        }
      ]
    }
  }
}

이처럼 확장형을 이용하면 term query와 boost와 같은 parameter를 함께 사용할 수 있다.

bool query에서 minimum_should_match의 default값은?

must 혹은 must_not과 함께 사용할 경우 : 0
must 혹은 must_not 없이 should만 사용할 경우 : 1

1. must와 함께 사용

minimum_should_match를 명시적으로 설정하지 않을 경우 : 1806

GET shopping/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "결제카드": "우리"
          }
        }
      ],
      "should": [
        {
          "match": {
            "고객성별": "남성"
          }
        }
      ]
    }
  }
}

minimum_should_match를 0으로 설정하는 경우 : 1806

GET shopping/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "결제카드": "우리"
          }
        }
      ],
      "should": [
        {
          "match": {
            "고객성별": "남성"
          }
        }
      ],
      "minimum_should_match": 0
    }
  }
}

2. `must` 혹은 `must_not` 없이 `should`만 사용

minimum_should_match를 명시적으로 설정하지 않는 경우 : 2714

GET shopping/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "고객성별": "남성"
          }
        }
      ]
    }
  }
}

minimum_should_match를 명시적으로 1로 설정하는 경우 : 2714

GET shopping/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "고객성별": "남성"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

참고자료
- stack overflow - default value of minimum_should_match
- elasticsearch - bool query

JSON Input에서 복수개 Parameter 설정이 가능한가요?

네 가능합니다

예를 들어 아래와 같은 두 조건을 입력한다고 가정해보면,
- 조건1
  - 설정 : 기존 값을 2배 하기
  - JSON Input
```
{
  "script" : {
    "source": "_value * 2"
  }
}
```
- 조건2
  - 설정 : missing value가 있을 경우 100으로 대체하기
  - JSON Input {"missing" : 100}
한 번에 입력하면,

{
  "script" : {
    "source": "_value * 2"
  },
  "missing" : 100
}

file input 사용시 파일 확장자를 catch 할 수 있는지?

가능하다

1단계 : default

logstash configuration

input {
  file {
    path => "/usr/share/logstash/data/titanic.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

output {
  stdout {
  }
}

실행 결과

{
  "host" => "65778ce05d75",
  "message" => "49,Samaan,0,3,male,29.69911765,2,0,2662,21.6792,C",
  "@version" => "1",
  "path" => "/usr/share/logstash/data/titanic.csv",
  "@timestamp" => 2018-07-25T16:20:08.338Z
}

2단계 : split 추가

logstash configuration

input {
  file {
    path => "/usr/share/logstash/data/titanic.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
  mutate {
    split => { "path" => "." }
  }
}


output {
  stdout {
  }
}

실행 결과

{
  "host" => "65778ce05d75",
  "message" => "49,Samaan,0,3,male,29.69911765,2,0,2662,21.6792,C",
  "@version" => "1",
  "path" => [
        [0] "/usr/share/logstash/data/titanic",
        [1] "csv"
   ],    
  "@timestamp" => 2018-07-25T16:20:10.413Z
}

3단계 : add_field 추가

logstash configuration

input {
  file {
    path => "/usr/share/logstash/data/titanic.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
  mutate {
    split => { "path" => "." }
    add_field => { "filename_extension" => "%{path[-1]}"}
  }
}

output {
  stdout {
  }
}

실행 결과

{
  "@version" => "1",
  "path" => [
        [0] "/usr/share/logstash/data/titanic",
        [1] "csv"
  ],
  "filename_extension" => "csv",
  "host" => "65778ce05d75",
  "message" => "47,Lennon,0,3,male,29.69911765,1,0,370371,15.5,Q",
  "@timestamp" => 2018-07-25T16:27:42.249Z
}

index에 저장된 데이터를 csv로 export 가능한지?

네

x-pack

동영상처럼 사용
file size는 Management - Advanced Settings - xpack.reporting.csv.maxSizeBytes에서 변경 가능
단, 설명에 나와 있듯이 너무 크게 해버리면 elasticsearch cluster의 성능과 스토리지에 영향을 줌

oss

1. logstash 이용

logstash는 elasticsearch와 같은 버전 사용
logstash Home Directory 이동
logstash-output-csv plugin 설치 : $ bin/logstash-plugin install logstash-output-csv
logstash conf 작성
- 예시)

input {
  elasticsearch {
    hosts => "localhost:9200" # 실제 데이터가 들어있는 host로 변경 필요
    index => "shopping"       # csv로 추출하려는 데이터가 담긴 index
    query => '{               # 선택한 index에서 어떤 데이터를 추출할 건지 query DSL 작성
      "query" : {
        "match_all" : {
        }
      }
    }'
  }
}

output {
  csv {
    fields => ["주문시간", "물건좌표"]  # csv 파일에 필요한 field 입력
    path => "/home/ec2-user/shopping-logstash.csv" # 파일이 저장될 경로 입력
  }
}

파일 확인 : /home/ec2-user/shopping-logstash.csv 에서 가능
단, 한 번에 출력 가능한 documents 개수 : 10,000개

2. (python) client 이용

elasticsearch와 같은 버전 설치 권장 (필수아님) : $ pip install elasticsearch==6.2.0
데이터 읽어서 저장하는 코드 작성

import csv
from elasticsearch import Elasticsearch

es = Elasticsearch(hosts=["localhost:9200"])
data = es.search(
         index="shopping",
         size=10000,
         body={
           "query" : {
             "match_all" : {
             }
           }
         }
       )

csv_columns = ['상품가격', '상품분류', '예약여부', '구매사이트', '수령시간', \ 
               '물건좌표', '주문시간', '결제카드', '고객ip', '판매자평점', '배송메모', \ 
               '고객주소_시도', '상품개수', '접수번호', '고객나이', '고객성별'
]
csv_file = '/home/ec2-user/shopping-python.csv'

with open(csv_file, 'w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
    writer.writeheader()
    for document in [x['_source'] for x in data['hits']['hits']]:
        writer.writerow(document)

파일 확인 : /home/ec2-user/shopping-python.csv 에서 가능
단, 한 번에 출력 가능한 documents 개수는 10,000개이며 그 이상은 Scroll 통해 구현

3. Kibana Data Table 이용

Data Table Visualization을 이용해 원하는 형태로 테이블 생성
하단의 Export 옆에 Raw 또는 Formatted 중 선택해서 출력
이 방법으로는 한 번에 전체 Documents 출력 가능
단, elasticsearch cluster에 가해지는 부하는 감안해야 함

logstash jdbc input을 계속 작동하게 할 수 없는지?

schedule을 통해서 가능합니다.

우선 `schedule` 없이 실행해보자

코드

input {
  jdbc {
    jdbc_validate_connection => true
    jdbc_connection_string => "jdbc:mysql://13.125.21.52:3306/fc"
    jdbc_user => "fc"
    jdbc_password => "fc"
    jdbc_driver_library => "/home/ec2-user/fc/logstash-5.6.4/driver/mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    statement => "SELECT * FROM fc WHERE id > :sql_last_value"   
    use_column_value => true
    tracking_column => "id"
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

결과

DB에서 데이터를 조회한 후 에 종료된다

이번엔 `schedule` 옵션 (매 초 조회) 을 넣어서 실행해보자

코드

input {
  jdbc {
    jdbc_validate_connection => true
    jdbc_connection_string => "jdbc:mysql://13.125.21.52:3306/fc"
    jdbc_user => "fc"
    jdbc_password => "fc"
    jdbc_driver_library => "/home/ec2-user/fc/logstash-5.6.4/driver/mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar"    jdbc_driver_class => "com.mysql.jdbc.Driver"
    statement => "SELECT * FROM fc"
    schedule => "* * * * * *"
    use_column_value => true
    tracking_column => "id"
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

결과

멈추지 않고 계속 동작하나 같은 데이터를 계속 읽어온다

schedule과 sql last value를 같이 이용하자

코드

input {
  jdbc {
    jdbc_validate_connection => true
    jdbc_connection_string => "jdbc:mysql://13.125.21.52:3306/fc"
    jdbc_user => "fc"
    jdbc_password => "fc"
    jdbc_driver_library => "/home/ec2-user/fc/logstash-5.6.4/driver/mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar"    
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    statement => "SELECT * FROM fc WHERE id > :sql_last_value"
    use_column_value => true
    tracking_column => "id"
    schedule => "* * * * * *"
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

결과
- 새로운 데이터가 없을 경우 쿼리는 실행하지만 새로운 값이 없으므로 데이터를 새로 읽지는 않는다
  *

* 이 때 새로운 데이터를 넣으면 바로 인식하고 처리한다 *

* 그리고 또 새로운 값이 들어올 때 까지 기다린다 *

kibana에서 elasticsearch index mapping을 할 수 있나요?

Web browser를 통해 kibana 접속
- 주소창에 {ip주소}:5601 입력
- 예를 들어 12.34.567.890:5601
Dev tools 클릭

elasticsearch index를 이미 생성한 경우
- test_index index가 이미 존재하는 상황에서
- test_type type을 생성하고,
- message field(text)와 number field(integer) mapping를 생성하는 작업

PUT test_index/test_type/_mapping
{
  "properties": {
    "message": {
      "type": "text"
    },
    "number" : {
      "type": "integer"
    }
  }
}

elasticsearch index를 아직 생성하지 않은 경우
- test_index index에 test_type type을 생성하고,
- message field(text)와 number field(integer) mapping를 생성하는 작업

PUT test_index
{
  "mappings": {
    "test_type": {
      "properties": {
        "message": {
          "type": "text"
        },
        "number" : {
          "type": "integer"
        }
      }
    }
  }
}

참고

Elasticsearch Reference [5.6] » Indices APIs » Put Mapping

인덱스 등록할때 Time Filter field 잘못 선택했는데 필드 바꿀 수 있나?

가능하다

1. .kibana/index-pattern에서 time filter field를 변경하려는 index의 _id를 찾는다

예시

index : scenario1_higee

코드

GET .kibana/index-pattern/_search
{
  "_source" : "_id",
  "query" : {
    "match" : {
      "title": "scenario1_higee"
    }
  }
}

이미지

2. 위에서 찾은 _id값과 변경하려는 time filter field 이름을 아래처럼 넣어서 실행한다

예시

  _id : AWD-qHIXPloSIAlpN7ee
  변경하려는 time filter field :  주문시간

코드

POST /.kibana/index-pattern/AWD-qHIXPloSIAlpN7ee/_update
{
  "doc": {
    "timeFieldName" : "주문시간"
  } 
}

이미지

Field Formatter에서 "Format : Url, Type : Image"의 용도와 사용 방법은?

이미지가 위치한 URL을 입력하면 Kibana에 직접 이미지를 보여줄 수 있다

(Sample) Mapping 생성

PUT higee_image
{
  "mappings": {
    "test" : {
      "properties": {
        "image" : {
          "type": "keyword"
        },
        "source" : {
          "type": "keyword"
        }
      }
    }
  }
}

(Sample) Data Indexing

POST  higee_image/test
{
  "image" : "dog",
  "source" : "https://www.flaticon.com/free-icon/dog_616408"
}

POST  higee_image/test
{
  "image" : "crab",
  "source" : "https://www.flaticon.com/free-icon/crab_1009977"
}

POST  higee_image/test
{ 
  "image" : "rubber-ring",
  "source" : "https://www.flaticon.com/free-icon/rubber-ring_1087315"
}

POST  higee_image/test
{
  "image" : "teddy-bear",
  "source" : "https://www.flaticon.com/free-icon/teddy-bear_1083850"
}

POST  higee_image/test
{
  "image" : "whale",
  "source" : "https://www.flaticon.com/free-icon/whale_866404"
}

Discover에서 확인

Field Format 변경 : 단, 사전에 AWS S3에 이미지 업로드 작업 완료

Discover에서 다시 확인

Source

Elastic - Field Formatters in Kibana 4.1

jdbc input 사용시 특정 column이 NULL인 걸 catch 할 수 있는지?

가능하다

우선 데이터를 확인하자

mysql> select * from test;
>>>
+----+-------+
| id | name  |
+----+-------+
|  1 | hello |
|  2 | NULL  |
+----+-------+

시나리오1 : `name` field가 `NULL`이면 event를 drop 해보자

logstash configuration

$ cat test.conf
>>>
input {
  jdbc {
    jdbc_validate_connection => true
    jdbc_connection_string => "jdbc:mysql://52.78.134.20:3306/fc"
    jdbc_user => "fc"
    jdbc_password => "fc"
    jdbc_driver_library => "/usr/share/logstash/driver/mysql-connector-java-5.1.36-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    statement => "SELECT * FROM test"
  }
}

filter {
  if ![name] {
    drop {}
  }
}

output {
  stdout {
  }
}

결과

{
    "name" => "hello",
    "id" => 1,
    "@timestamp" => 2018-07-21T14:31:34.624Z,
    "@version" => "1"
}

시나리오2 : `name` field가 `NULL`이면 특정한 값으로 대체하자

logstash configuration

$ cat test.conf
>>>
input {
  jdbc {
    jdbc_validate_connection => true
    jdbc_connection_string => "jdbc:mysql://52.78.134.20:3306/fc"
    jdbc_user => "fc"
    jdbc_password => "fc"
    jdbc_driver_library => "/usr/share/logstash/driver/mysql-connector-java-5.1.36-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    statement => "SELECT * FROM test"
  }
}

filter {
  ruby {
    code => "event.set('name', 'this is null') if event.get('name').nil?"
  }
}

output {
  stdout {
  }
}

결과

{
  "id" => 1,
  "name" => "hello",
  "@version" => "1",
  "@timestamp" => 2018-07-21T14:36:31.711Z
}
{
   "id" => 2,
  "name" => "this is null",
  "@version" => "1",
  "@timestamp" => 2018-07-21T14:36:31.757Z
}

시나리오3 : 모든 field에서 `NULL` 여부를 확인하고 `NULL`인 field만 제거하자

logstash configuration

$ cat test.conf
>>>
input {
  jdbc {
    jdbc_validate_connection => true
    jdbc_connection_string => "jdbc:mysql://52.78.134.20:3306/fc"
    jdbc_user => "fc"
    jdbc_password => "fc"
    jdbc_driver_library => "/usr/share/logstash/driver/mysql-connector-java-5.1.36-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    statement => "SELECT * FROM test"
  }
}

filter {
  ruby {
    code => "
      hash = event.to_hash
      hash.each do |k,v|
        if v == nil
          event.remove(k)
        end
      end
    "
  }
}

output {
  stdout {
  }
}

결과

{
  "@timestamp" => 2018-07-21T14:39:02.454Z,
  "@version" => "1",
  "id" => 1,
  "name" => "hello"
}
{
  "@timestamp" => 2018-07-21T14:39:02.505Z,
  "@version" => "1",
  "id" => 2
}

Source

Logstash Reference - Event API

엑셀파일(xls, xlsx) 형태도 쉽게 올릴 수 있는지

우선 Elastic Stack 측면에서 파일을 전송하는 부분은 logstash인데, 현재 공식 버전 (6.1)까지 excel 지원은 없다. csv의 경우 file input plugin 활용해서 전송한다.

꼭 excel data를 이용해야 한다면 다음과 같은 방법이 있을 것이다.

Elasticsearch client(Python, Java 등) 활용해서 직접 elasticsearch에 insert하기
excel을 직접 csv format으로 변환해서 logstash 활용하기
community plugin 활용하기
- github
- demo
  - elasticsearch 5+
  - elasticsearch 6+

input message size에 제한이 있는지?

input plugin 마다 기준이 상이한 것 같다

예시1. file input plugin

데이터 (message.txt)

$ wc -lm message.txt
>>>
1 276481 message.txt

logstash configuration

input {
  file {
    path => "/usr/share/logstash/message.txt"
    start_position => "beginning"
  }
}

output {
  stdout {
  }
}

결과 : 위와 같이 27만 character로 된 single message도 정상적으로 처리 가능

예시2. udp input plugin

logstash configuration : buffer size로 제한

input {
  udp {
    port => 5000
    buffer_size => 100
  }
}

output {
  stdout {
  }
}

log 전송

$ logger -n 127.0.0.1 -P 5000 "Test Message : hello world"
$ logger -n 127.0.0.1 -P 5000 "Test Message : hello world. Welcome to higee/elastic repository"
$ logger -n 127.0.0.1 -P 5000 "Test Message : hello world. Welcome to higee/elastic repository. In this course, we get to learn the basics of elastic stack"

결과 : buffer size를 초과하는 message는 bugger size 이상 부분부터 내용이 잘리는 거 확인

2018-07-22T02:53:33.719Z 127.0.0.1 <5>Jul 22 11:53:33 ec2-user: Test Message : hello world
2018-07-22T02:53:57.301Z 127.0.0.1 <5>Jul 22 11:53:57 ec2-user: Test Message : hello world. Welcome to higee/elastic repository
2018-07-22T02:54:13.635Z 127.0.0.1 <5>Jul 22 11:54:13 ec2-user: Test Message : hello world. Welcome to higee/elastic repository. In thi

Source

Mapping Datatype과 다른 Datatype Value 입력하기

다음과 같은 mapping을 가진 index가 있다고 하자

{
  "tt": {
    "mappings": {
      "tt": {
        "properties": {
          "번호": {
            "type": "integer"
          },
          "시간": {
            "type": "date"
          },
          "이름": {
            "type": "keyword"
          },
          "이메일": {
            "type": "text"
          }
        }
      }
    }
  }
}

그리고 데이터를 적당히 입력한 후 전체 값을 조회해보자

GET tt/_search
{
  "query" : { 
    "match_all" : {}
  }
}

결과

잘보면 integer로 등록한 번호에 "1"이라는 string(keyword)가 들어가있고, keyword로 등록한 "이름"에 5라는 numeric type이 들어가 있다.

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "tt",
        "_type": "tt",
        "_id": "2",
        "_score": 1,
        "_source": {
          "번호": 1,
          "이름": "higee"
        }
      },
      {
        "_index": "tt",
        "_type": "tt",
        "_id": "1",
        "_score": 1,
        "_source": {
          "번호": 1
        }
      },
      {
        "_index": "tt",
        "_type": "tt",
        "_id": "3",
        "_score": 1,
        "_source": {
          "번호": "1",
          "이름": 5
        }
      }
    ]
  }
}

aggregation도 되는지 확인해보자

이름 Field

Query

GET tt/_search
{ 
  "size": 0, 
  "query": {
    "match_all": {}
  },
  "aggs": {
    "test_agg": {
      "sum": {
        "field": "이름"
      }
    }
  }
}

결과

{
"error": {
  "root_cause": [
    {
      "type": "illegal_argument_exception",
      "reason": "Expected numeric type on field [이름], but got [keyword]"
    }
  ],
  "type": "search_phase_execution_exception",
  "reason": "all shards failed",
  "phase": "query",
  "grouped": true,
  "failed_shards": [
    {
      "shard": 0,
      "index": "tt",
      "node": "eJL5WBAkTlCts8NhtI_dmA",
      "reason": {
        "type": "illegal_argument_exception",
        "reason": "Expected numeric type on field [이름], but got [keyword]"
      }
    }
  ]
},
"status": 400
}

번호 Field

Query

GET tt/_search
{ 
  "size": 0, 
    "query": {
      "match_all": {}
    },
  "aggs": {
    "test_agg": {
      "sum": {
        "field": "번호"
      }
    }
  }
}

결과

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "test_sum": {
      "value": 3
    }
  }
}

그렇다면 왜 이런 것일까?

단순히 Elasticsearch가 허용하기 때문이다. 강제로 막으려면 아래와 같이 할 수 있다

특정 Field 단위

Query : : 아래와 같이 설정하면 number_two Field는 Integer Type만 허용한다

PUT my_index
{
 "mappings": {
   "my_type": {
     "properties": {
       "number_one": {
         "type": "integer"
       },
       "number_two": {
         "type": "integer",
         "coerce": false
       }
     }
   }
 }
}

결과 : 다음과 같은 값은 입력이 안된다.

 PUT my_index/my_type/5
 {
   "number_two" : 1.3
 }

특정 Index 전체

Query

PUT my_index
{
  "settings": {
    "index.mapping.coerce": false
  },
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer"
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  } 
}

결과 : 모든 Field에 대해서 Mapping으로 설정한 Data Type만 허용한다

누적평균도 시각화 가능한지?

가능하다

우선 Aggregation 차원에서 한 번에 가능하지는 않다.
가령 아래와 같은 데이터가 있다고 가정하자.
(단, shopping index 특성상 매출 = "상품가격"의 합으로 구한다고 하자)

날짜	매출	개수
2018.01.01	423,000	23
2018.01.02	477,000	26
2018.01.03	394,000	22
2018.01.04	465,000	27
2018.01.05	352,000	21
2018.01.06	361,000	24

여기서 매출의 누적평균을 구하려면 어떻게 해야 할까?
누적평균이 무엇을 의미하는지 생각해보면 구현 할 수는 있다.
가령 누적평균을 일별로 "매출"을 누적하여 더한 값을 "개수"를 누적하여 더한 값으로 나눠준 값이라고 정의하자

날짜	매출	개수	누적 매출	누적 개수
2018.01.01	423,000	23	423,000	23
2018.01.02	477,000	26	900,000	49
2018.01.03	394,000	22	1,294,000	71
2018.01.04	465,000	27	1,759,000	98
2018.01.05	352,000	21	2,111,000	119
2018.01.06	361,000	24	2,472,000	143

위의 표를 이용하면 일별 누적 평균은 아래와 같게 구할 수 있다.

날짜	매출	개수	누적 매출	누적 개수	누적 평균
2018.01.01	423,000	23	423,000	23	18391
2018.01.02	477,000	26	900,000	49	18367
2018.01.03	394,000	22	1,294,000	71	18225
2018.01.04	465,000	27	1,759,000	98	17948
2018.01.05	352,000	21	2,111,000	119	17739
2018.01.06	361,000	24	2,472,000	143	17286

그렇다면 이러한 값을 어떻게 Kibana 상에서 시각화 할 수 있을까?
가장 쉬운 방법으로 Timelion을 이용할 수 있다.

.es(index=shopping, timefield=주문시간, metric=sum:상품가격).cusum().divide(.es(index=shopping, timefield=주문시간).cusum()).label(누적평균)

다른 그래프와 같이 보면 아래와 같다

.es(index=shopping, timefield=주문시간, metric=sum:상품가격).cusum().label("매출 누적합").yaxis(yaxis=2, label='매출 누적합', min=0), 
.es(index=shopping, timefield=주문시간, metric=sum:상품가격).cusum().divide(.es(index=shopping, timefield=주문시간).cusum()).label("누적 평균").yaxis(yaxis=1, min=0, max=30000, label='누적 평균 및 누적 개수').points(), 
.es(index=shopping, timefield=주문시간).cusum().label("누적 개수").bars()

input message size가 특정값보다 큰 경우를 catch 할 수 있는지?

가능하다

방법1. ruby filter

logstash configuraiton

역할 : input message가 10글자 이상이면 5글자까지만 slicing
코드

input {
  stdin {
  }
}

filter {
  ruby {
    code => "
      if event.get('message').length > 10
        event.set('new message', event.get('message')[0..5])
        event.remove('message')
        event.tag('ruby filter activated')
      end
    "
  }
}

output {
  stdout {
  }
}

테스트

$ hi
>>>
{
  "message" => "hi",
  "@timestamp" => 2018-07-22T05:12:15.331Z,
  "host" => "69a61635b4d0",
  "@version" => "1"
}

$ hello world
>>>
{
  "@timestamp" => 2018-07-22T05:12:18.906Z,
  "host" => "69a61635b4d0",
  "tags" => [
    [0] "ruby filter activated"
  ],
  "@version" => "1",
  "new message" => "hello "
}

방법2. truncate filter

logstash configuraiton

역할 : message size가 10 bytes보다 크면 10 bytes 이상인 부분을 날리자
코드

input {
  stdin {
  }
}

filter {
  truncate {
    fields => ["message"]
    length_bytes => 10
  }
}

output {
  stdout {
  }
}

테스트

$ hi
>>>
{
  "host" => "69a61635b4d0",
  "@version" => "1",
  "@timestamp" => 2018-07-22T05:20:46.970Z,
  "message" => "hi"
}

$ hello world
>>>
{
  "host" => "69a61635b4d0",
  "@version" => "1",
  "@timestamp" => 2018-07-22T05:21:00.556Z,
  "message" => "hello worl"
}

방법3. range filter

logstash configuraiton

역할
- message length < 10 : short tag 추가
- 10 ≤ message length < 20 : drop
코드

input {
  stdin {
  }
}

filter {
  range {
    ranges => [
      "message", 0, 10, "tag:short",
      "message", 10, 20, "drop"
    ]
  }
}

output {
  stdout {
  }
}

테스트

$ hi
>>>
{
  "host" => "69a61635b4d0",
  "tags" => [
    [0] "short"
  ],
  "message" => "hi",
  "@timestamp" => 2018-07-22T05:48:21.098Z,
  "@version" => "1"
}

$ hello world

Source

Template으로 Mapping 설정한 후 Reindex 하면 적용되는지?

네 적용됩니다.

reindex하려는 original index mapping

PUT test_reindex/_mapping/test_reindex
{
  "properties": {
    "번호": {
      "type": "integer"
    },
    "시간": {
      "type": "date"
    },
    "이름": {
      "type": "keyword"
    },
    "이메일": {
      "type": "text"
    }
  }
}

original index 데이터 삽입

PUT test_reindex/test_reindex/1
{
  "번호" : 1,
  "시간" : "2017-12-01",
  "이름": "higee",
  "이메일" : "[email protected]"
}

reindex의 destination으로 사용할 index의 mapping을 template으로 생성

PUT _template/test_template
{
  "template": "test_reindex_destination",
  "mappings": {
    "my_type": {
      "properties": {
        "번호": {
          "type": "integer"
        },
        "시간": {
          "type": "date"
        },
        "이름": {
          "type": "keyword"
        },
        "이메일": {
          "type": "text"
        }
      }
    }
  }
}

destination index 생성

PUT test_reindex_destination

destination index mapping 확인

명령어 : GET test_reindex_destination/_mapping
결과 확인

{
  "test_reindex_destination": {
    "mappings": {
      "my_type": {
        "properties": {
          "번호": {
            "type": "integer"
          },
          "시간": {
            "type": "date"
          },
          "이름": {
            "type": "keyword"
          },
          "이메일": {
            "type": "text"
          }
        }
      }
    }
  }
}

reindex 진행

POST _reindex
{
  "source": {
    "index": "test_reindex"
  },
  "dest": {
    "index": "test_reindex_destination"
  }
}

destination index에 데이터 제대로 들어갔는지 확인

GET test_reindex_destination/_search
{
  "query": {
    "match_all": {}
  }
}

Search 결과를 시각화해서 공유할 수 있나?

가능하다

1. 우선 Discover 페이지로 이동한다

2. 저장하고 싶은 내용을 검색한다

Index : shopping
Time Range : Last 90 days
Query : "내에 시간 배송 못함"~2

3. 검색결과를 저장한다

4. Visualize에서 저장한 검색을 사용한다

5. Visualization을 만든다

6. Visualization을 저장한다

7. Visualization을 공유한다

8. 검색결과가 어떻게 반영되었는지 확인하자

csv-header를 logstash로 자동 탐지할 수 있는지?

가능하다

예를 들어 아래와 같은 csv 데이터(파일명 : `titanic.csv`)가 있다고 하자

Index, Name, Survival, Pclass, Sex, Age, SibSp, Parch, Ticket, Fare, Embarked
1,Braund,0,3,male,22,1,0,A/5 21171,7.25,S
2,Cumings,1,1,female,38,1,0,PC 17599,71.2833,C
3,Heikkinen,1,3,female,26,0,0,STON/O2. 3101282,7.925,S

이 때, header를 field 이름으로 설정하려면 아래와 같이 conf 파일(파일명 : `csv.conf`)을 설정한다

input {
  file  {
    path => "/home/ec2-user/fc/logstash-5.6.4/sample/titanic.csv"
  }
}

filter {
  csv {
    separator => ","
    autodetect_column_names => true
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

실행해보자

$ bin/logstash -f csv.conf

결과는 아래와 같다

template의 mapping은 수정 가능한지?

색인(indexing)이 되기 전이라면 가능하다

~~시나리오1 (x)~~
시나리오2 (o)

시나리오1

template 생성

PUT _template/test
{
  "index_patterns": "test*",
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "keyword"
        }
      }
    }
  }
}

색인

POST test1/test
{
  "name" : "higee"
}

template 수정

PUT _template/test
{
  "index_patterns": "test*",
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "text"
        }
      }
    }
  }
}

재색인

POST test1/test
{
  "name" : "higee/elastic"
}

mapping 확인

GET test1/_mapping
>>>
{
  "test1": {
    "mappings": {
      "test": {
        "properties": {
          "name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

시나리오2

template 생성

PUT _template/test
{
  "index_patterns": "test*",
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "keyword"
        }
      }
    }
  }
}

template 수정

PUT _template/test
{
  "index_patterns": "test*",
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "text"
        }
      }
    }
  }
}

색인

POST test2/test
{
  "name" : "higee"
}

mapping 확인

GET test2/_mapping
>>>
{
  "test1": {
    "mappings": {
      "test": {
        "properties": {
          "name": {
            "type": "text"
          }
        }
      }
    }
  }
}

template 확인

GET /_template/test
>>>
{
  "test": {
    "order": 0,
    "index_patterns": [
      "test*"
    ],
    "settings": {},
    "mappings": {
      "test": {
        "properties": {
          "name": {
            "type": "text"
          }
        }
      }
    },
    "aliases": {}
  }
}

Scripted Field에서 AND/OR 등 조건을 사용할 수 있나요?

Boolean Operator를 통해 가능하다

Boolean Operator
- AND
  - 문법 : &&
  - 의미 : 모든 조건을 만족
  - 예시
    - 상품개수가 3개 미만이고 상품가격이 15000 미만일 경우 저소비, 아니면 과소비
    - if (doc['상품개수'].value < 3 && doc['상품가격'].value < 15000){ return "저소비" } return "과소비"
- OR
  - 문법 : ||
  - 의미 : 적어도 한 개의 조건 만족
  - 예시
    - 상품개수가 3개 미만이거나 상품가격이 15000 미만일 경우 저소비, 아니면 과소비
    - if (doc['상품개수'].value < 3 || doc['상품가격'].value < 15000){ return "저소비" } return "과소비"
- NOT
  - 문법 : !
  - 의미 : 조건을 만족하지 않는 경우
  - 예시
    - 상품개수가 3개 미만이고 상품가격이 15000 이상일 경우 저소비, 아니면 과소비
    - if (doc['상품개수'].value < 3 && !(doc['상품가격'].value < 15000)){ return "저소비" } return "과소비"

Time Picker의 Time Range를 Unix (ms) timestamp 형식으로 지정 가능?

아니오

Time Picker의 Time Range은 Quick, Relative, Absolute로 설정 가능

Quick

Relative

Absolute : YYYY-MM-DD HH:mm:ss.SSS 형식만 인식

단,

보여지는 Date Field의 Format을 Unix Timestamp 형식으로 할 수는 있다.
Management - Advanced Settings - dateFormat 수정 : X 혹은 x

Before : Default: MMMM Do YYYY, HH:mm:ss.SSS

After : X

Discover에서 sort에 사용할 수 있는 Field는?

Discover에서 sorting이란 아래와 같이 사용할 수 있는지를 말한다

data type 별로 살펴보자

데이터 타입1	데이터 타입2	가능 여부
string	text	✕
string	keyword	◎
numeric	long	◎
numeric	integer	◎
numeric	short	◎
numeric	byte	◎
numeric	double	◎
numeric	float	◎
numeric	half-float	◎
numeric	scaled-float	◎
date	date	◎
boolean	boolean	◎
geo_point	geo_point	✕
array		✕
nested		✕
object		◎

그렇다면 text, geo_point, array, nested 등은 elasticsearch에서 sort가 안되나?

text : 안된다 → 필요한 경우 multi fields 이용
geo_point : geo_distance로 비교하려면 indexing된 geo_point 말고 기준 point 입력 필요

GET test/_search
{
  "sort": [
    {
      "_geo_distance": {
        "geo_point": "37.11, 128.12",
        "order": "asc",
        "unit": "km"
      }
    }
  ]
}

array : array의 field는 값이 1개 이상이므로 어떻게 aggregate해서 sort할 건지 명시 필요

GET test/_search
{
  "sort": {
    "array.가격": {
      "order": "asc",
      "mode": "avg"
    }
  }
}

nested : array와 마찬가지로 aggregation 조건과 어떤 field가 nested인지 명시 필요

GET test/_search
{
  "sort": [
    {
      "nested.가격": {
        "mode": "avg",
        "order": "asc",
        "nested": {
          "path": "nested"
        }
      }
    }
  ]
}

sample mapping

PUT test
{
  "mappings": {
    "test": {
      "properties": {
        "date" : {
          "type" : "date"
        },
        "geo_point" : {
          "type": "geo_point"
        },
        "text": {
          "type": "text"
        },
        "long": {
          "type": "long"
        },
        "integer": {
          "type": "integer"
        },
        "short": {
          "type": "short"
        },
        "byte": {
          "type": "byte"
        },
        "double": {
          "type": "double"
        },
        "float": {
          "type": "float"
        },
        "half-float": {
          "type": "half_float"
        },
        "scaled-float": {
          "type": "scaled_float",
          "scaling_factor": 100
        },
        "boolean": {
          "type": "boolean"
        },
        "array": {
          "properties": {
            "가격": {
              "type": "integer"
            }
          }
        },
        "nested": {
          "type": "nested",
          "properties": {
            "가격": {
              "type": "integer"
            }
          }
        },
        "object": {
          "properties": {
            "가격": {
              "type": "integer"
            }
          }
        }
      }
    }
  }
}

sample data

PUT test/test/1
{
  "date" : "2018-08-18T16:40:33",
  "text": "hello world",
  "long": 1,
  "geo_point" : "37.11, 128.11",
  "integer": 1,
  "short": 1,
  "byte": 1,
  "double": 0.1,
  "float": 0.1,
  "half-float": 0.1,
  "scaled_float": 0.1,
  "boolean": true,
  "object": {
    "가격": 1
  },
  "nested": [
    {
      "가격": 100
    },
    {
     "가격": 200
    }
  ],
  "array" : [
    {
      "가격": 100
    },
    {
     "가격": 200
    }
  ]
}

higee / elastic Goto Github PK

elastic's People

Contributors

Stargazers

Watchers

Forkers

elastic's Issues

테스트 환경 설정

테스트

1. 마지막 라인에 내용 추가

2. 가장 윗 라인에 내용 추가

3.1 가장 마지막 라인 내용 수정

3.2 가장 마지막 라인 내용 수정

doc['my_field'].value

params['_source']['my_field']

Source

정렬 기준이 unicode value라고 가정

default timelion

시나리오1

시나리오2

왜 이럴까?

방법1

방법2

주의할 점

Source

Kibana (Dev Tools)

cURL

조회하려는 필드에 적어도 한 개의 null이 아닌 값이 있는 Documents를 검색한다

방법1 logstash password filepath 사용

방법2 environment variable 사용

방법3 Logstash Keystore 이용 (권장)

Source

The inode number (or equivalent)

The major/minor device number of the file system (or equivalent)

Source

Source

미션

접근법 1

접근법 2

1. must와 함께 사용

2. must 혹은 must_not 없이 should만 사용

1단계 : default

2단계 : split 추가

3단계 : add_field 추가

x-pack

oss

1. logstash 이용

2. (python) client 이용

3. Kibana Data Table 이용

우선 schedule 없이 실행해보자

이번엔 schedule 옵션 (매 초 조회) 을 넣어서 실행해보자

schedule과 sql last value를 같이 이용하자

참고

(Sample) Mapping 생성

(Sample) Data Indexing

Discover에서 확인

Field Format 변경 : 단, 사전에 AWS S3에 이미지 업로드 작업 완료

Discover에서 다시 확인

Source

우선 데이터를 확인하자

시나리오1 : name field가 NULL이면 event를 drop 해보자

시나리오2 : name field가 NULL이면 특정한 값으로 대체하자

시나리오3 : 모든 field에서 NULL 여부를 확인하고 NULL인 field만 제거하자

Source

예시1. file input plugin

예시2. udp input plugin

Source

다음과 같은 mapping을 가진 index가 있다고 하자

그리고 데이터를 적당히 입력한 후 전체 값을 조회해보자

결과

aggregation도 되는지 확인해보자

그렇다면 왜 이런 것일까?

방법1. ruby filter

방법2. truncate filter

방법3. range filter

Source

reindex하려는 original index mapping

original index 데이터 삽입

reindex의 destination으로 사용할 index의 mapping을 template으로 생성

destination index 생성

정렬 기준이 `unicode value`라고 가정

2. `must` 혹은 `must_not` 없이 `should`만 사용

우선 `schedule` 없이 실행해보자

이번엔 `schedule` 옵션 (매 초 조회) 을 넣어서 실행해보자

시나리오1 : `name` field가 `NULL`이면 event를 drop 해보자

시나리오2 : `name` field가 `NULL`이면 특정한 값으로 대체하자

시나리오3 : 모든 field에서 `NULL` 여부를 확인하고 `NULL`인 field만 제거하자

예를 들어 아래와 같은 csv 데이터(파일명 : `titanic.csv`)가 있다고 하자

이 때, header를 field 이름으로 설정하려면 아래와 같이 conf 파일(파일명 : `csv.conf`)을 설정한다