Git Product home page Git Product logo

loglang's Introduction

Hi there, I'm Nic! ๐Ÿ‘‹

LinkedIn

loglang's People

Contributors

nicwaller avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

loglang's Issues

Input: Lumberjack

Lumberjack (elastic/go-lumber) is a protocol developed by Elastic.

The preferred protocol for moving events between Logstash instances.

And apparently Elastic Beats use Lumberjack v1 or v2.

Reference:

Spec (v1; 2016) https://github.com/elastic/logstash-forwarder/blob/master/PROTOCOL.md

Protocol Characteristics

  • Text-based protocol
  • Transport: TCP/5044
  • Framing: binary header for each field (must decode all fields to get a frame)
  • Codec: syslog
  • Confidentiality: TLS
  • Integrity: supports bulk acknowledgements (unlike RELP), negotiated window size
  • Compression: per-frame, zlib

Get a performance baseline

What kind of I/O?

  • unix socket
  • TCP socket
  • UDP socket
  • stdin/stdout
  • HTTP (use k6)

Performance measurements:

  • time to first byte at destination
  • time to acknowledgement
  • time until all events have been processed
  • inter-frame latency (time between frames from server)
  • latency from send to acknowledgement

Capacity:

  • what load will start causing substantial delay?
  • what load will crash the system?

Testing with E2E both enabled and disabled

It may be possible to expose some of these performance numbers through the /metrics endpoint.

Output batches

How to decide how many events should be written in a batch to a file, to an S3 object, to Elasticsearch, in a single HTTP POST request, etc.

Output: http_request

  • sends an http request
  • should support batching strategy
  • customizable user agent
  • customizable http method (but default to POST)
  • options to enable sending compressed data
  • ideally should support HTTP_PROXY environment variable
  • Authorization Bearer tokens
  • other custom headers (possibly using a callback?)

See Also

AWS inputs/outputs

It would be great to be interoperable with these AWS systems:

  • S3 (either paired with SQS, or not)
  • SQS
  • Kinesis
  • MSK (Kafka)
  • CloudWatch Alarms
  • CloudWatch Logs
  • DynamoDB
  • DynamoDB streams
  • AmazonMQ
  • Elasticache (Redis plugin should be good enough?)
  • SNS
  • SES
  • Lambda
  • EventBridge
  • Step Functions

But this probably deserves a separate repo entirely.

Filter: lookback

Compare the current event with a previous matching event

Identify matching event by match on a field. If matching on multiple fields is desired, use the fingerprint filter first.

Copy previous event into @previous

could this be used to calculate a "high water mark" and send alerts when that is exceeded?

โš ๏ธ This may require substantial storage.

Input: http_request

  • should be able to run continuously (http long polling)
  • should be able to run on a schedule or on-demand
  • customize HTTP method (eg. GET, POST)
  • customizable user-agent?
  • full support for variable codecs and chained framing
  • send Accept header to enable compression
  • optionally use Content-Type hinting to automatically select framing and codec
  • make proper use of Etag to respect internet caches
  • ideally should support HTTP_PROXY environment variable

See Also

Use Cases

GitHub API โ€” https://docs.github.com/en/rest/overview/authenticating-to-the-rest-api?apiVersion=2022-11-28

Output: tcp_listener

It would be really cool to "tune in" and get an immediate live firehose of all events flowing out of the pipeline.

Doing this with a WebSocket would also be cool.

Outputs should also have a filter chain

There's the primary filter chain, and a filter chain for each input.

Each output should also have a filter chain so it's possible to customize for that output.

  • Output to Elasticsearch could use filter rules to ensure field types are consistent
  • Output to Slack could use filter rules to populate fields used by the Slack output plugin

Output: websocket

would be really cool to support websockets so that a browser can "tune in" to a realtime firehose of events. browser should be able to provide a filter that is executed on the server.

Output: http_listen

Listen for HTTP requests, and reply with recent events.

Two modes:

  • buffer always replies with recent events. This is not reliable delivery, but it can be useful for peeking at recent events. The size of the buffer (number of events, number of bytes) is configured part of the output.
  • cursor allows the client to fetch new events sent after the cursor position. Server replies with an updated cursor position. Due to limited buffer space, events may be dropped. This imperfection is acceptable.

Should probably allow the client to provide a filter. If filtering, then more modes are useful:

  • buffer-per-client (so that rare events don't get overwhelmed by common ones)
  • shared-buffer (when we don't trust the client)

Use Case

this can be useful for constructing a simple web interface that shows not-quite-realtime view of logs

if paired with a pipeline that does no filtering, this could also be useful as a way to peek at the recent past and see what the raw events looked like

Scheduled inputs

Some inputs don't run continuously; they run on a schedule or on demand or once at startup.

For example, an input that periodically polls an HTTP API. Probably want to use a channel to trigger the input. Then the channel could be fed by a recurring cron schedule.

Note: It is impossible to run a task every 14 days using cron so other types of recurrence schedules should be supportable.

Or even more interesting, loglang could provide an API that allows on-demand triggering of scheduled inputs. For example, a pipeline that reads from a dead letter queue would only be triggered on-demand.

Heartbeat should use this approach too.

Input: rss

Why not turn this into an RSS reader?

Framing: dsv

DSV (Delimiter-Separated Values) is mostly known as CSV and TSV for commas and tabs respectively.

Rows of tabular data can be interpreted as events by combining the header with the value. But because the header is stored outside the value, this complicates the framing pattern.

decoder for Prometheus text-based metric format

The Prometheus metrics format looks like this:

http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

Reference: https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md#exposition-formats

In combination with an HTTP fetch input type, this could be used to generate events from Prometheus-capable endpoints.

Input: stdin

it should be possible to read events from process standard input

and loglang should exit with status 0 when standard input is closed

no E2E acknowledgement is needed

remember to populate ECS schema fields like hostname

Output: tcp_stream

  • write events to a tcp stream
  • connect to arbitrary {ip, port}
  • should support looking up hostname
    • if hostname lookup fails (after a few tries) end the output
  • should use SetNoDelay, but also be smart about internal buffering strategy
  • option to bin-pack so that each TCP packet contains whole events
  • if remote peer closes the stream, reconnect (up to a limit/timeout)
    • make sure to lookup the hostname again when reconnecting, in case DNS has changed
  • ignore anything sent back to us over TCP (maybe call CloseRead?)
  • how to handle keepalive?
  • should the stream be torn down when there are no events?

this should be easy

Output: udp

try to respect a batching strategy, while respecting that the max UDP datagram size is 65,515 bytes

this should be easy to implement

Input: exec

  • exec() a local process
  • read both stdout and stderr
  • populate ECS Schema process fields, especially process exit code
  • should provide default environment that is minimal but indicates invocation from loglang (keep PATH?)
  • should support arguments to the process
  • should support shell out too?
  • no need to send anything to stdin of launched process
  • should support scheduling to re-run periodically
  • if not scheduled, loglang should exit with status 0 but only if exec is the only input
  • what to use for default working directory? a tmp dir that gets cleaned up by loglang?
  • is there any reason to support parallelism?
  • provide NOCOLOR in the environment by default

use cases:

  • scraping of various kinds

@metadata

maybe events should have a separate store of metadata that doesn't get sent by outputs

Logstash uses the @metadata field, but it would be fine to have a separate field in the Event struct

Input: git

Git is super interesting! New branches, new commits, new tags can all be interpreted as events. The reflog (reference log) will probably be important here.

This could be very interesting:

  • Git -> filters > Slack

Output: File

  • output to a file, pipe, or socket
  • should be usable for output to systemd log device
  • support output batches for writing to files with different names
  • how to handle naming pattern? for example, organizing by /year/month/day/hour? or log.1, log.2, log.3, ...

Input: relp

RELP (the Reliable Event Logging Protocol) was proposed by Rainer Gerhards, the lead developer of rsyslog, in 2008.

Compared to plain syslog, RELP allows to receiver to send acknowledgements confirming the message was received.

Reference

Specification: https://github.com/rsyslog/librelp/blob/master/doc/relp.html
Mailing List: https://lists.adiscon.net/mailman/listinfo/relp (requires membership)
Implementation: https://github.com/rsyslog/librelp

Protocol Characteristics

  • Text-based protocol
  • Transport: always TCP
  • Framing: content-length header (plus a bit extra to support acks)
  • Codec: syslog

Pipelining is a key feature (client can send multiple requests without waiting for first response). Responses must be sent by the server in the exact same order as commands where received

Version 1.1 adds support for TLS using STARTTLS.

Output: redis

Redis is very cool and it would be great to support it.

But not as part of the core suite; we should use an existing Redis module for this.

Modes

Input: stomp

STOMP (Streaming Text Orientated Messaging Protocol) provides an interoperable wire format so that STOMP clients can communicate with any STOMP message broker to provide easy and widespread messaging interoperability among many languages, platforms and brokers.

https://stomp.github.io

Reference

Protocol Characteristics

  • Transport: TCP
  • Framing: sometimes null delimiter, sometimes content-length
  • Codec: very custom(?)
  • Integrity: acknowledgement with receipt frames, transaction commit and rollback
  • Authentication: username/password

Providing guarantees about output field types

Elasticsearch is strict about field types within a given index. If you try to add two documents to the same index, like this:

{"status": 200}
{"status": "OK"}

Elasticsearch will refuse to index the second document, and if you're using Logstash that failure is silent. ๐Ÿ˜ฑ

There are several things that Loglang could do to prepare output for Elasticsearch:

  • automatic coercion of field types to the first seen type
  • Automatic coercion of field types to a fixed schema (json schema)
  • Send failed events to a dead letter queue

Output: exec

  • exec() a local process
  • send events to stdin of exec'ed process
  • what to do with stdout/stderr from that process? drop, but write warning. (use exec input instead)
  • populate ECS Schema process fields, especially process exit code
  • should provide default environment that is minimal but indicates invocation from loglang (keep PATH?)
  • should support arguments to the process
  • should support shell out too?
  • what to use for default working directory? a tmp dir that gets cleaned up by loglang?
  • probably nice to support multiple exec in parallel up to some limit

Input: redis

Redis is very cool and it would be great to support it.

But not as part of the core suite; we should use an existing Redis module for this.

Modes

  • queue/list using LPOP/RPOP
    • alternating between LPOP and RPOP has nice characteristics during overload scenarios. but this behaviour should be configurable (LPOP, RPOP, or alternating)
  • pub/sub using SUBSCRIBE
  • set using SPOP
  • stream using XREAD

Input: unix_socket

Unix domain sockets cannot be read like regular files, so a special input plugin is needed.

Requirements

  • support two socket modes (listen vs. connect)
  • populate ECS schema fields, especially host.name and file.path (even though it's not really a file) and network.transport = uds (unix domain socket)
  • end input if the socket is unavailable (no retry?)
  • unix domain sockets can be either byte stream (no framing) or datagram (framing)!

Description

Unix sockets are reliable. If the reader doesn't read, the writer blocks. If the socket is a datagram socket, each write is paired with a read. If the socket is a stream socket, the kernel may buffer some bytes between the writer and the reader, but when the buffer is full, the writer will block. Data is never discarded, except for buffered data if the reader closes the connection before reading the buffer.

Motivation

Unix domain sockets are used by traditional syslog and systemd. Supporting socket input would enable direct replacement of rsyslogd.

The GNU C Library provides functions to submit messages to Syslog. They do it by writing to the /dev/log socket. See Submitting Syslog Messages.

Source: https://www.gnu.org/software/libc/manual/html_node/Overview-of-Syslog.html

~ $ ls -lac /dev/log /run/systemd/journal/dev-log
lrwxrwxrwx 1 root root 28 Dec 22  2022 /dev/log -> /run/systemd/journal/dev-log
srw-rw-rw- 1 root root  0 Dec 22  2022 /run/systemd/journal/dev-log

Apparently Docker also uses unix sockets.

Tips

Run netstat -a -p --unix to see all unix sockets on the local system.

Use socat or hookah for development testing.

Input: File

  • tail or read whole file
    • tail mode should keep track of file position between process restarts. use local filesystem for saving bookmark.
    • whole file mode should support scheduling
  • single file, or glob path
  • support ECS Schema attributes for files (path, etc.)
  • save bookmark as offset in file, or based on the content of a field that expresses a total order?

preserve original event

To support the ECS schema field event.original it might be worth storing the original bytes (after framing, before codec) and providing an option to automatically include that on each output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.