cloudurable / cassandra-image Goto Github PK

View Code? Open in Web Editor NEW

49.0 49.0 8.0 107 KB

Cassandra Image using Packer for Docker and EC2 AMI. Covers managing EC2 Cassandra clusters with Ansible.

Shell 100.00%

ami ansible cassandra cloudwatch devops ec2 support vagrant

cassandra-image's People

Contributors

Stargazers

Watchers

Forkers

dwchoe optionalg drorvt jmcabandara umapathyv sbilello jagadeeshmahesh rogervaas

cassandra-image's Issues

create repeater for CloudWatch from logback/JSON format

make cassandra-cloud a systemd one shot

[Unit]
Description=Cassandra Cloud

[Service]
Type=oneshot
ExecStart=/opt/cassandra/bin/cassandra-cloud
TimeoutSec=0
StandardOutput=tty
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target (add cassandra as target)

cassandra.yaml - rpc_server_type is not compatible with an rpc_max_threads

Exception (org.apache.cassandra.exceptions.ConfigurationException) encountered during startup: The hsha rpc_server_type is not compatible with an rpc_max_threads setting of 'unlimited'. Please see the comments in cassandra.yaml for rpc_server_type and rpc_max_threads.

enable JMX remoting

setup ansible

more sysctl.conf settings etc

Linux
sysctl.conf

From ....

https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

Compare to what we have

The primary interface for tuning the Linux kernel is the /proc virtual filesystem. In recent years, the /sys filesystem has expanded on what /proc does. Shell scripts using echo and procedural code are difficult to manage automatically, as we all know from handling cassandra-env.sh. This led the distros to create /etc/sysctl.conf, and in modern distros, /etc/sysctl.conf.d. The sysctl command reads the files in /etc and applies the settings to the kernel in a consistent, declarative fashion.

The following is a block of settings I use almost every server I touch. Most of these are safe to apply live and should require little tweaking from site to site. Note: I have NOT tested these extensively with multi-DC, but most of them should be safe. Those items that may need extra testing for multi-DC have comments indicating it.

http://tobert.github.io/post/2014-06-24-linux-defaults.html

# The first set of settings is intended to open up the network stack performance by
# raising memory limits and adjusting features for high-bandwidth/low-latency
# networks.
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 2500
net.core.somaxconn = 65000
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_window_scaling = 1
net.ipv4.ip_local_port_range = 10000 65535

# this block is designed for and only known to work in a single physical DC
# TODO: validate on multi-DC and provide alternatives
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_orphan_retries = 1

# significantly reduce the amount of data the kernel is allowed to store
# in memory between fsyncs
# dirty_background_bytes says how many bytes can be dirty before the kernel
# starts flushing in the background. Set this as low as you can get away with.
# It is basically equivalent to trickle_fsync=true but more efficient since the
# kernel is doing it. Older kernels will need to use vm.dirty_background_ratio
# instead.
vm.dirty_background_bytes = 10485760

# Same deal as dirty_background but the whole system will pause when this threshold
# is hit, so be generous and set this to a much higher value, e.g. 1GB.
# Older kernels will need to use dirty_ratio instead.
vm.dirty_bytes = 1073741824

# disable zone reclaim for IO-heavy applications like Cassandra
vm.zone_reclaim_mode = 0

# there is no good reason to limit these on server systems, so set them
# to 2^31 to avoid any issues
# Very large values in max_map_count may cause instability in some kernels.
fs.file-max = 1073741824
vm.max_map_count = 1073741824

# only swap if absolutely necessary
# some kernels have trouble with 0 as a value, so stick with 1
vm.swappiness = 1
On vm.max_map_count:

http://linux-kernel.2935.n7.nabble.com/Programs-die-when-max-map-count-is-too-large-td317670.html

limits.conf (pam_limits)
The DSE and DSC packages install an /etc/security/limits.d/ file by default that should remove most of the problems around pam_limits(8). Single-user systems such as database servers have little use for these limitations, so I often turn them off globally using the following in /etc/security/limits.conf. Some users may already be customizing this file, in which case change all of the asterisks to cassandra or whatever the user DSE/Cassandra is running as.

* - nofile     1000000
* - memlock    unlimited
* - fsize      unlimited
* - data       unlimited
* - rss        unlimited
* - stack      unlimited
* - cpu        unlimited
* - nproc      unlimited
* - as         unlimited
* - locks      unlimited
* - sigpending unlimited
* - msgqueue   unlimited

Ensure cassandra uses jemalloc

test ansible setup and finish article

test ansible. create user. finish article.

config based on ergonomics

There are many settings in the conf file that can be optimized based on ergonmics of the deployment evn.

Number of threads based on number of vCPUs.

Type of GC based on size of AMI instances memory.

Default garbage collector is changed from Concurrent-Mark-Sweep (CMS) to G1. G1 performance is better for nodes with heap size of 4GB or greater.

create systemd service for cloudwatch log metrics (ec2 only)

add chrt from Al's guide

chrt
The Linux kernel's default policy for new processes is SCHED_OTHER. The SCHED_OTHER policy is designed to make interactive tasks such as X windows and audio/video playback work well. This means the scheduler assigns tasks very short time slices on the CPU so that other tasks that may need immediate service can get time. This is great for watching cat videos on Youtube, but not so great for a database, where interactive response is on a scale of milliseconds rather than microseconds. Furthermore, Cassandra's threads park themselves properly. Setting the scheduling policy to SCHED_BATCH seems more appropriate and can open up a little more throughput. I don't have good numbers on this yet, but observations of dstat on a few clusters have convinced me it's useful and doesn't impact client latency.

chrt --batch 0 $COMMAND
chrt --batch 0 --all-tasks --pid $PID
You can inject these into cassandra-env.sh or /etc/{default,sysconfig}/{dse,cassandra} by using the $$ variable that returns the current shell's pid. Child processes inherity scheduling policies, so if you set the startup shell's policy, the JVM will inherit it. Just add this line to one of those files:

chrt --batch 0 --pid $$

install OS config for cassandra

create systemd service for cloudwatch log repeater (ec2 only)

optimize read ahead

Disk readahead Disk readahead boosts sequential access to the disk by reading a little more data than requested ahead of time to mitigate some effects of slow disk reads. This is called readahead (RA). This means less frequent requests to the disk. But this function has its disadvantages as well. If your system is performing high-frequency random reads and writes, a high RA would translate them into magnified reads and/ or writes— much higher I/ O operations than actually is. This will slow down the system. (It also piles up memory with the data that you do not actually need.) To view the current value of RA, execute blockdev –report as shown in the following command line: $ sudo blockdev –report

Neeraj, Nishant (2015-03-26). Mastering Apache Cassandra - Second Edition (Kindle Locations 2799-2806). Packt Publishing. Kindle Edition.

Questions about enable_materialized_views and enable_transient_replication

Thanks for providing this useful container! I have a question about two cassandra configs: enable_materialized_views and enable_transient_replication. It seems the official document
does not recommend to enable them in production but they are set to be enabled by default:

Shall we disable them?
Thanks!

ansible to aws not working yet

Error

ansible aws-node0 -m ping
aws-node0 | UNREACHABLE! => {
    "changed": false, 
    "msg": "Failed to connect to the host via ssh.", 
    "unreachable": true
}

inventory.ini

[nodes]
node0 ansible_user=vagrant
node1 ansible_user=vagrant
node2 ansible_user=vagrant
#node3
#node4

[aws-nodes]
aws-node0 ansible_user=cassandra

Hosts

$ cat /etc/hosts
...
### Used for ansible vagrant
192.168.50.20  bastion
192.168.50.4  node0
192.168.50.5  node1
192.168.50.6  node2
192.168.50.7  node3
192.168.50.8  node4
192.168.50.9  node5

54.202.53.234 aws-node0

Testing: use cassandra cloud to run a cluster via vagrant

Missing systemd-cloud-watch config file

/opt/cloudurable/bin/systemd-cloud-watch
main ERROR: 2017/02/27 00:53:03 main.go:39: Usage: systemd-cloud-watch  <config-file>
  -help
        set to true to show this help
config file name must be set!

Check relatime setting in image

Filesystems & Mount options & other urban legends
Cassandra relies on a standard filesystem for storage. The choice of filesystem and how it's configured can have a large impact on performance.

One common performance option that I find amusing is the noatime option. It used to bring large gains in performance by avoiding the need to write to inodes every time a file is accessed. Many years ago, the Linux kernel changed the default atime behavior from synchronous to what is called relatime which means the kernel will batch atime updates in memory for a while and update inodes only periodically. This removes most of the performance overhead of atime, making the noatime tweak obsolete.

Another option I've seen abused a few times is the barrier/nobarrier flag. A filesystem barrier is a transaction marker that filesystems use to tell underlying devices which IOs need to be committed together to achieve consistency. Barriers may be disabled on Cassandra systems to get better disk throughput, but this should NOT be done without full understanding of what it means. Without barriers in place, filesystems may come back from a power failure with missing or corrupt data, so please read the mount(8) man page first and proceed with caution.

Install libjemalloc1 for centos

EC2 setup - ansible

Create Vagrant Image

Use Cassandra 3 instead of Cassandra 2.

change the password of the default cassandra user

can't find log folder

[root@1c8b8ef6c9d4 /]# OpenJDK 64-Bit Server VM warning: Cannot open file /opt/cassandra/bin/../logs/gc.log due to No such file or directory

change image to use xfs

Missing /etc/metricsd.conf

/opt/cloudurable/bin/metricsd
INFO     : [main] - 2017/02/27 00:50:29 config.go:30: Loading config /etc/metricsd.conf
panic: open /etc/metricsd.conf: no such file or directory

goroutine 1 [running]:
panic(0x7009a0, 0xc420101200)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
main.main()
        /gopath/src/github.com/advantageous/metricsd/main.go:18 +0x335

disable swapping docker

disable swapping

WARN [main] 2017-02-14 08:21:18,157 CLibrary.java:163 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.

WARN  [main] 2017-02-14 08:21:18,157 CLibrary.java:163 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.

WARN [main] 2017-02-14 08:21:18,158 StartupChecks.java:193 - OpenJDK is not recommended. Please upgrade to the newest Oracle Java release

WARN [main] 2017-02-14 08:21:18,158 StartupChecks.java:193 - OpenJDK is not recommended. Please upgrade to the newest Oracle Java release

change logging to use... logstash encoder

change logging to use... logstash encoder
https://github.com/logstash/logstash-logback-encoder

Then... repeat that to cloudwatch.

set streaming timeout to 1 hour by default.. now is 24 hours

streaming
Make sure to always set streaming_socket_timeout_in_ms to a non-zero value. 1 hour is a conservative choice that will prevent the worst behavior.

https://issues.apache.org/jira/browse/CASSANDRA-8611

ensure tuned for SSD from Al Tolberts tuning guide

From https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

discovery

When getting acquainted with a new machine, one of the first things to do is discover what kind of storage is installed. Here are some handy commands:

blockdev --report
fdisk -l
ls -l /dev/disk/by-id
lspci -v # pciutils
sg_inq /dev/sda # sg3-utils
ls /sys/block

IO elevator, read-ahead, IO merge

Folks spend a lot of time worrying about tuning SSDs, and that's great, but on modern kernels these things usually only make a few % difference at best. That said, start with these settings as a default and tune from there.

Use deadline if no Docker

When in doubt, always use the deadline IO scheduler. The default IO scheduler is CFQ, which stands for "Completely Fair Queueing". This is the only elevator that supports IO prioritization via cgroups, so if Docker or some other reason for cgroups is in play, stick with CFQ. In some cases it makes sense to use the noop scheduler, such as in VMs and on hardware RAID controllers, but the difference between noop and deadline is small enough that I only ever use deadline. Some VM-optimized kernels are hard-coded to only have noop and that's fine.

echo 1 > /sys/block/sda/queue/nomerges # SSD only! 0 on HDD
echo 8 > /sys/block/sda/queue/read_ahead_kb # up to 128, no higher
echo deadline > /sys/block/sda/queue/scheduler

I usually start with read_ahead_kb at 8 on SSDs and 64 on hard drives (to line up with Cassandra <= 2.2's sstable block size). With mmap IO in <= 2.2 and all configurations >= 3.0. Setting readahead to 0 is fine on many configurations but has caused problems on older kernels, making 8 a safe choice that doesn't hurt latency.

Beware: setting readahead very high (e.g. 512K) can look impressive from the system side by driving high IOPS on the storage while the client latency degrades because the drives are busy doing wasted IO. Don't ask me how I know this without buying me a drink first.

validate download by checking checksum of file

We need to do this for any binary that we are downloading.