cloudurable / cassandra-image Goto Github PK
View Code? Open in Web Editor NEWCassandra Image using Packer for Docker and EC2 AMI. Covers managing EC2 Cassandra clusters with Ansible.
Cassandra Image using Packer for Docker and EC2 AMI. Covers managing EC2 Cassandra clusters with Ansible.
test docker image
turn off GC logs by default
create repeater for CloudWatch from logback/JSON format
[Unit]
Description=Cassandra Cloud
[Service]
Type=oneshot
ExecStart=/opt/cassandra/bin/cassandra-cloud
TimeoutSec=0
StandardOutput=tty
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target (add cassandra as target)
Exception (org.apache.cassandra.exceptions.ConfigurationException) encountered during startup: The hsha rpc_server_type is not compatible with an rpc_max_threads setting of 'unlimited'. Please see the comments in cassandra.yaml for rpc_server_type and rpc_max_threads.
enable JMX remoting
setup ansible
Linux
sysctl.conf
From ....
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
Compare to what we have
The primary interface for tuning the Linux kernel is the /proc virtual filesystem. In recent years, the /sys filesystem has expanded on what /proc does. Shell scripts using echo and procedural code are difficult to manage automatically, as we all know from handling cassandra-env.sh. This led the distros to create /etc/sysctl.conf, and in modern distros, /etc/sysctl.conf.d. The sysctl command reads the files in /etc and applies the settings to the kernel in a consistent, declarative fashion.
The following is a block of settings I use almost every server I touch. Most of these are safe to apply live and should require little tweaking from site to site. Note: I have NOT tested these extensively with multi-DC, but most of them should be safe. Those items that may need extra testing for multi-DC have comments indicating it.
http://tobert.github.io/post/2014-06-24-linux-defaults.html
# The first set of settings is intended to open up the network stack performance by
# raising memory limits and adjusting features for high-bandwidth/low-latency
# networks.
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 2500
net.core.somaxconn = 65000
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_window_scaling = 1
net.ipv4.ip_local_port_range = 10000 65535
# this block is designed for and only known to work in a single physical DC
# TODO: validate on multi-DC and provide alternatives
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_orphan_retries = 1
# significantly reduce the amount of data the kernel is allowed to store
# in memory between fsyncs
# dirty_background_bytes says how many bytes can be dirty before the kernel
# starts flushing in the background. Set this as low as you can get away with.
# It is basically equivalent to trickle_fsync=true but more efficient since the
# kernel is doing it. Older kernels will need to use vm.dirty_background_ratio
# instead.
vm.dirty_background_bytes = 10485760
# Same deal as dirty_background but the whole system will pause when this threshold
# is hit, so be generous and set this to a much higher value, e.g. 1GB.
# Older kernels will need to use dirty_ratio instead.
vm.dirty_bytes = 1073741824
# disable zone reclaim for IO-heavy applications like Cassandra
vm.zone_reclaim_mode = 0
# there is no good reason to limit these on server systems, so set them
# to 2^31 to avoid any issues
# Very large values in max_map_count may cause instability in some kernels.
fs.file-max = 1073741824
vm.max_map_count = 1073741824
# only swap if absolutely necessary
# some kernels have trouble with 0 as a value, so stick with 1
vm.swappiness = 1
On vm.max_map_count:
http://linux-kernel.2935.n7.nabble.com/Programs-die-when-max-map-count-is-too-large-td317670.html
limits.conf (pam_limits)
The DSE and DSC packages install an /etc/security/limits.d/ file by default that should remove most of the problems around pam_limits(8). Single-user systems such as database servers have little use for these limitations, so I often turn them off globally using the following in /etc/security/limits.conf. Some users may already be customizing this file, in which case change all of the asterisks to cassandra or whatever the user DSE/Cassandra is running as.
* - nofile 1000000
* - memlock unlimited
* - fsize unlimited
* - data unlimited
* - rss unlimited
* - stack unlimited
* - cpu unlimited
* - nproc unlimited
* - as unlimited
* - locks unlimited
* - sigpending unlimited
* - msgqueue unlimited
Ensure cassandra uses jemalloc
test ansible. create user. finish article.
There are many settings in the conf file that can be optimized based on ergonmics of the deployment evn.
Number of threads based on number of vCPUs.
Type of GC based on size of AMI instances memory.
Default garbage collector is changed from Concurrent-Mark-Sweep (CMS) to G1. G1 performance is better for nodes with heap size of 4GB or greater.
create systemd service for cloudwatch log metrics (ec2 only)
chrt
The Linux kernel's default policy for new processes is SCHED_OTHER. The SCHED_OTHER policy is designed to make interactive tasks such as X windows and audio/video playback work well. This means the scheduler assigns tasks very short time slices on the CPU so that other tasks that may need immediate service can get time. This is great for watching cat videos on Youtube, but not so great for a database, where interactive response is on a scale of milliseconds rather than microseconds. Furthermore, Cassandra's threads park themselves properly. Setting the scheduling policy to SCHED_BATCH seems more appropriate and can open up a little more throughput. I don't have good numbers on this yet, but observations of dstat on a few clusters have convinced me it's useful and doesn't impact client latency.
chrt --batch 0 $COMMAND
chrt --batch 0 --all-tasks --pid $PID
You can inject these into cassandra-env.sh or /etc/{default,sysconfig}/{dse,cassandra} by using the $$ variable that returns the current shell's pid. Child processes inherity scheduling policies, so if you set the startup shell's policy, the JVM will inherit it. Just add this line to one of those files:
chrt --batch 0 --pid $$
install OS config for cassandra
create systemd service for cloudwatch log repeater (ec2 only)
Disk readahead Disk readahead boosts sequential access to the disk by reading a little more data than requested ahead of time to mitigate some effects of slow disk reads. This is called readahead (RA). This means less frequent requests to the disk. But this function has its disadvantages as well. If your system is performing high-frequency random reads and writes, a high RA would translate them into magnified reads and/ or writes— much higher I/ O operations than actually is. This will slow down the system. (It also piles up memory with the data that you do not actually need.) To view the current value of RA, execute blockdev –report as shown in the following command line: $ sudo blockdev –report
Neeraj, Nishant (2015-03-26). Mastering Apache Cassandra - Second Edition (Kindle Locations 2799-2806). Packt Publishing. Kindle Edition.
Thanks for providing this useful container! I have a question about two cassandra configs: enable_materialized_views
and enable_transient_replication.
It seems the official document
does not recommend to enable them in production but they are set to be enabled by default:
Shall we disable them?
Thanks!
ansible aws-node0 -m ping
aws-node0 | UNREACHABLE! => {
"changed": false,
"msg": "Failed to connect to the host via ssh.",
"unreachable": true
}
[nodes]
node0 ansible_user=vagrant
node1 ansible_user=vagrant
node2 ansible_user=vagrant
#node3
#node4
[aws-nodes]
aws-node0 ansible_user=cassandra
$ cat /etc/hosts
...
### Used for ansible vagrant
192.168.50.20 bastion
192.168.50.4 node0
192.168.50.5 node1
192.168.50.6 node2
192.168.50.7 node3
192.168.50.8 node4
192.168.50.9 node5
54.202.53.234 aws-node0
Testing: use cassandra cloud to run a cluster via vagrant
/opt/cloudurable/bin/systemd-cloud-watch
main ERROR: 2017/02/27 00:53:03 main.go:39: Usage: systemd-cloud-watch <config-file>
-help
set to true to show this help
config file name must be set!
Filesystems & Mount options & other urban legends
Cassandra relies on a standard filesystem for storage. The choice of filesystem and how it's configured can have a large impact on performance.
One common performance option that I find amusing is the noatime option. It used to bring large gains in performance by avoiding the need to write to inodes every time a file is accessed. Many years ago, the Linux kernel changed the default atime behavior from synchronous to what is called relatime which means the kernel will batch atime updates in memory for a while and update inodes only periodically. This removes most of the performance overhead of atime, making the noatime tweak obsolete.
Another option I've seen abused a few times is the barrier/nobarrier flag. A filesystem barrier is a transaction marker that filesystems use to tell underlying devices which IOs need to be committed together to achieve consistency. Barriers may be disabled on Cassandra systems to get better disk throughput, but this should NOT be done without full understanding of what it means. Without barriers in place, filesystems may come back from a power failure with missing or corrupt data, so please read the mount(8) man page first and proceed with caution.
Install libjemalloc1 for centos
EC2 setup - ansible
Create Vagrant Image
Use Cassandra 3 instead of Cassandra 2.
change the password of the default cassandra user
[root@1c8b8ef6c9d4 /]# OpenJDK 64-Bit Server VM warning: Cannot open file /opt/cassandra/bin/../logs/gc.log due to No such file or directory
change image to use xfs
/opt/cloudurable/bin/metricsd
INFO : [main] - 2017/02/27 00:50:29 config.go:30: Loading config /etc/metricsd.conf
panic: open /etc/metricsd.conf: no such file or directory
goroutine 1 [running]:
panic(0x7009a0, 0xc420101200)
/usr/local/go/src/runtime/panic.go:500 +0x1a1
main.main()
/gopath/src/github.com/advantageous/metricsd/main.go:18 +0x335
disable swapping
WARN [main] 2017-02-14 08:21:18,157 CLibrary.java:163 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
WARN [main] 2017-02-14 08:21:18,158 StartupChecks.java:193 - OpenJDK is not recommended. Please upgrade to the newest Oracle Java release
change logging to use... logstash encoder
https://github.com/logstash/logstash-logback-encoder
Then... repeat that to cloudwatch.
streaming
Make sure to always set streaming_socket_timeout_in_ms to a non-zero value. 1 hour is a conservative choice that will prevent the worst behavior.
From https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
When getting acquainted with a new machine, one of the first things to do is discover what kind of storage is installed. Here are some handy commands:
blockdev --report
fdisk -l
ls -l /dev/disk/by-id
lspci -v # pciutils
sg_inq /dev/sda # sg3-utils
ls /sys/block
Folks spend a lot of time worrying about tuning SSDs, and that's great, but on modern kernels these things usually only make a few % difference at best. That said, start with these settings as a default and tune from there.
When in doubt, always use the deadline IO scheduler. The default IO scheduler is CFQ, which stands for "Completely Fair Queueing". This is the only elevator that supports IO prioritization via cgroups, so if Docker or some other reason for cgroups is in play, stick with CFQ. In some cases it makes sense to use the noop scheduler, such as in VMs and on hardware RAID controllers, but the difference between noop and deadline is small enough that I only ever use deadline. Some VM-optimized kernels are hard-coded to only have noop and that's fine.
echo 1 > /sys/block/sda/queue/nomerges # SSD only! 0 on HDD
echo 8 > /sys/block/sda/queue/read_ahead_kb # up to 128, no higher
echo deadline > /sys/block/sda/queue/scheduler
I usually start with read_ahead_kb at 8 on SSDs and 64 on hard drives (to line up with Cassandra <= 2.2's sstable block size). With mmap IO in <= 2.2 and all configurations >= 3.0. Setting readahead to 0 is fine on many configurations but has caused problems on older kernels, making 8 a safe choice that doesn't hurt latency.
Beware: setting readahead very high (e.g. 512K) can look impressive from the system side by driving high IOPS on the storage while the client latency degrades because the drives are busy doing wasted IO. Don't ask me how I know this without buying me a drink first.
validate download by checking checksum of file
We need to do this for any binary that we are downloading.
install cassandra as systemd service
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.