m2shad0w / blog Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 2.44 MB

:fire: :clap: :dog: blog

Home Page: https://m2shad0w.com/blog

Shell 0.41% CoffeeScript 0.45% JavaScript 46.35% CSS 52.59% HTML 0.19%

big-data machine-learning-algorithms network notebook notes system

blog's People

Contributors

Stargazers

Watchers

Forkers

ii0

blog's Issues

data-hub json -> protobuf 的性能提升

前段时间将日志服务从 json 格式到 protobuf 的迁移，性能真的是有了很大的提升，以下图示现阶段常规跑起来之后的系统资源占比。请求的量比例基本一致。

响应时间

结论

protobuf 承载的io压力是 json 的5倍，响应时间差不多，系统资源消耗将为一半。在我们日志回收应用中有很好的性能提升。

参考文章

https://medium.com/@caffeinocode/bye-bye-json-welcome-protocol-buffers-a3e4319ba51
https://developers.google.com/protocol-buffers/docs/proto3
https://www.infoq.cn/article/json-is-5-times-faster-than-protobuf

ambari 升级记录

presto 实践优化

最优化理论在 cpm 广告系统中的决策

大数据高效压缩算法选型

refer

https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

一次 dns 解析问题的回溯

线下环境 dns 解析定位

在数据同步的时候，采用拉取的方式，拉取成功与否存储在线上数据库。
偶尔 dns 解析失败

考虑使用抓包方式分析一下

首先写个脚本循环发起数据链接断开操作
在脚本运行机器上抓包
可以通过　ifconfig 获取网卡名，DNS 解析是从 53 port 发出的

tcpdump -i enp3s0 -s 0 -w /var/tmp/dns.cap port 53

直到代码出错报 DNS 解析问题，停止抓包，获取数据包导入　wireshark 分析
观察数据，第 21 条记录 DNS解析出错了。
查看服务器 DNS 配置

/etc/resolv.conf

# Generated by NetworkManager
#search hunliji.cn
nameserver 114.114.114.114
nameserver 223.6.6.6

发现原来的配置文件中有 search 的用处
参考　http://dns-learning.twnic.net.tw/bind/intro4.html 解释。原来 dns 解析中有多了 hunliji.cn 后缀，我们实际的请求域名解析并不是这样的。是不是去掉　search 就好了呢？
去掉　search, 解析正常

参考

elastic-search-sync-data

构建搜索引擎

数据实时同步
数据离线同步

数据实时同步

使用的工具是　go-mysql-elasticsearch, 修改了一些原作者的代码，以便支持简单的 geo location 同步,　通过经纬度的两个字段合并到一个字段，在增量同步修改经纬度需要这两个地段同时满足。整个涉及思路是工具模拟 mysql master 的 slave。

数据离线同步

对于其他表关联的字段，上面的工具还没发完全支持，通过 python 代码离线定时维护

es 远程访问的坑

压缩算法研究

sublime 开发环境

打算把开发 ide 换成 sublime 3 简单做个记录

准备
1.0 mac dmg 下载地址
1.1 文档地址 https://www.sublimetext.com/docs/3/
1.2 安装 Package Control https://packagecontrol.io/installation#st3
插件安装
2.0 SublimeTmpl 代码模板生成
2.1 GitGutter 版本代码改动对比插件
2.2 代码自动补全插件
- SublimeCodeIntel
- jedi
终端打开别名设置

alias sub='open -a "Sublime Text"'

sublime 打开终端

参考 wbond/sublime_terminal#89

快捷键打开文件所在目录 SHIRT + COMMAND + T

#!/bin/bash

# Modified following this issue: https://github.com/wbond/sublime_terminal/issues/89

CD_CMD="cd "\\\"$(pwd)\\\"" && clear"
if echo "$SHELL" | grep -E "/fish$" &> /dev/null; then
  CD_CMD="cd "\\\"$(pwd)\\\""; and clear"
fi
VERSION=$(sw_vers -productVersion)
OPEN_IN_TAB=0

while [ "$1" != "" ]; do
    PARAM="$1"
    VALUE="$2"
    case "$PARAM" in
        --open-in-tab)
            OPEN_IN_TAB=1
            ;;
    esac
    shift
done

if (( $(expr $VERSION '<' 10.7) )); then
    RUNNING=$(osascript<<END
    tell application "System Events"
        count(processes whose name is "iTerm")
    end tell
END
)
else
    RUNNING=1
fi

if (( ! $RUNNING )); then
    osascript<<END
    tell application "iTerm"
            tell current window
                tell current session of (create tab with default profile)
                    write text "$CD_CMD"
                end tell
            end tell
            activate
    end tell
END
else
    if (( $OPEN_IN_TAB )); then
        osascript &>/dev/null <<EOF
        tell application "iTerm"
                    if (count of windows) = 0 then
                        set theWindow to (create window with default profile)
                        set theSession to current session of theWindow
                    else
                        set theWindow to current window
                        tell current window
                            set theTab to create tab with default profile
                            set theSession to current session of theTab
                        end tell
                    end if
                    tell theSession
                        write text "$CD_CMD"
                    end tell
                    activate
        end tell
EOF
    else
        osascript &>/dev/null <<EOF
        tell application "iTerm"
                    tell (create window with default profile)
                        tell the current session
                            write text "$CD_CMD"
                        end tell
                    end tell
                    activate
        end tell
EOF
    fi
fi

设置 tab 键自动转换成 space

    // Preferences.sublime-settings
    // The number of spaces a tab is considered equal to
    "tab_size": 4,

    // Set to true to insert spaces when tab is pressed
    "translate_tabs_to_spaces": true,

pep8 自动格式化 py 代码

https://packagecontrol.io/packages/Python%20PEP8%20Autoformat

OSX: ctrl+shift+r

markdown 预览插件

https://github.com/revolunet/sublimetext-markdown-preview

热预览 https://github.com/revolunet/sublimetext-markdown-preview#live-reload

sublime 3 支持 js 开发插件安装

构建私有 python 源仓库

私有源用于团队多项目应用

代码封装共用，私有不想发布在 pypi 上
加速常用优秀 python 库快速安装

私有源服务选择

参考https://wiki.python.org/moin/PyPiImplementations

选择了小而美，并且最近一年还在维护的 pypiserver

搭建过程

安装环境

export PRIVATE_PYPI=xxx
cd $PRIVATE_PYPI
virtualenv pypienv    # 建立一个virtaulenv

source $PRIVATE_PYPI/pypienv/bin/activate
pip install pypiserver   # 安装pypi server

mkdir $PRIVATE_PYPI/package # 建立存放packages的文件夹

写入 shell 启动

#run-pypi.sh
#!/bin/sh
# 启动virtualenv
. $PRIVATE_PYPI/pypienv/bin/activate
exec pypi-server -p 3141 $PRIVATE_PYPI/package

使用 supervisor 维护进程

pip install supervisor
echo_supervisord_conf > /etc/supervisord.conf #生成配置文件
supervisord #启动

pypi-server

#配置 pypi-server
[program:pypi-server]
directory=/home/hadoop
command=sh run-pypi.sh
autostart=true
autorestart=true
redirect_stderr=true
startretries=3     ; 启动失败自动重试次数，默认是 3
user=root          ; 用哪个用户启动
redirect_stderr=true  ; 把 stderr 重定向到 stdout，默认 false
stdout_logfile_maxbytes=20MB  ; stdout 日志文件大小，默认 50MB
stdout_logfile_backups=20     ; stdout 日志文件备份数; stdout 日志文件，需要注意当指定目录不存在时无法正常启动，所以需要手动创建目录（supervisord 会自动创建日志文件）
stdout_logfile=/var/www/logs/pypi_stdout.log

# 软链 
$ cd /etc/supervisor/conf.d/
$ sudo ln -s $PRIVATE_PYPI/pypi-supervisor.conf pypi-supervisor.conf

start

supervisorctl start pypi-server

验证权限

上传 package 需要用户名密码，密码文件使用命令 htpasswd #生成

sudo yum install httpd-tools # ubuntu apt-get install apache2-utils
htpasswd -sc $PRIVATE_PYPI/.htaccess user

更新 run_pypi.sh

exec pypi-server -p 3141 -P $PRIVATE_PYPI/.htaccess $PRIVATE_PYPI/package

刷新 supervisor
sudo supervisorctl reload
打包机配置 .pypirc

[distutils]
index-servers=privatepypi 

[privatepypi]
repository:url
username:your name
password:your passwd

上传
python setup.py sdist upload -r privatepypi

安装私有源

pip install --extra-index-url path package-name --trusted-host path

参考 https://github.com/pypiserver/pypiserver#quickstart-installation-and-usage

cpm 广告实践

资料整理

整理一下机器学习、深度学习资料

stanford-tensorflow-tutorials

运筹学

量化交易

hadoop 集群命令

hadoop 以特定用户创建hadoop目录

sudo -u hdfs hadoop fs -mkdir /user/myfile

merge file

hadoop fs -getmerge track_event/16-10-17/event_16-10-17_04_00_00 event_16-10-17_04_00_00.log

distcp

sudo -u hadoop hadoop distcp hdfs://master:9000//user/hadoop/track_event/16-12-02 hdfs://$HOST:8020/flume/track_event/dt=16-12-02

error info

main : run as user is nobody
main : requested yarn user is hdfs
Can't create directory

change the permit of yarn mr dir
like chmod -R 777 /hadoop/yarn/local

hack

sudo -u hadoop hadoop fs -ls hftp://$HOST:50070/user/hadoop/

hdfs 资源倾斜

手动 balance

数据挖掘 JD 描述

职位诱惑：

大平台交流机会多学习空间大年终奖

算法

工作职责：

1. 通过大量商业数据，分析实施商品挖掘、用户推荐、卖家分析、用户画像等；
2. 个性化推荐系统、广告系统、搜索系统、机器学习系统、风控系统、爬虫系统完善。
3. 与数据产品广泛沟通，提供日常基础数据，提高数据营效率；

任职要求：

本科或以上学历，计算机、数学、电子相关专业毕业（或者相同能力证明）；
熟悉 Python、R 语言，对 centos 平台了解；
能通过** tensorflow、mxnet、xgboost** 抽象业务建模解决实际问题；
熟练掌握基础算法和数据结构，了解数据挖掘、机器学习、并行计算相关理论；

加分项：

有风控、推荐、人群画像等领域模型构建和调优工作经验者优先；
有大型互联网工作经验者、 github、个人网站者优先；
在顶级会议或者期刊发表过相关论文，以及在数据挖掘竞赛中提交成绩并获得一定的名次；

数据开发

职位描述：

工作职责：

参与婚礼纪大数据架构优化设计，各业务系统优化完善，包括个性化推荐系统、广告系统、搜索系统、机器学习系统、风控系统、爬虫系统；
与数据产品广泛沟通，提供日常基础数据，提高数据营效率；

任职要求：

本科或以上学历，计算机、数学、电子相关专业毕业（或者相同能力证明）；
熟悉 **Python、Go、Scala、Shell、Java **至少两门编程语言，精通 centos 平台；
熟悉网络、IO等性能分析，有较高的系统优化能力；
优秀的数据敏感能力和逻辑分析能力；

加分项：

有 Spark、Flink 平台上的大数据处理工作经验者优先；
有大型互联网工作经验者、 github、个人网站者优先；

内在要求：

学习能力
沟通能力
有较高的软实力

对标人才

婚礼纪发展

发展沿革

2013 年 3 月婚礼纪 1.0 正式上线；

2013 年 7 月获得青松基金数百万元天使投资；

2013 年 11 月婚礼纪 2.0 发布，引入商家模块，升级为结婚电商服务平台；

2014 年 8 月获得祥峰投资（Vertex）数百万美元的 A 轮融资；

2014 年 9 月推出“新娘说”社区功能；

2014 年 12 月支付系统搭建完成，支持在线交易；

2015 年 4 月获得经纬创投领投的千万美元级 B 轮融资；

2015 年 11 月推出婚品电商交易板块；

2016 年 3 月获得 B＋轮投资，由复星昆仲领头，经纬和祥峰继续跟投，两轮融资

总达 3000 万美元；

2016 年 7 月发布结婚行业首款 SaaS 系统“云蝌”（现已升级为“海草云”）；

2016 年 8 月接入金融入口，支持金融分期付款项目“新婚贷”；

2016 年 10 月在杭州开设结婚行业首个依托大数据平台的新体验模式店——婚礼

纪体验中心；

2018 年 3 月完成由兰馨亚洲领投，经纬**、复星集团跟投的 6500 万美金 C 轮

融资，并计划发起 2 亿人民币产业投资基金，布局上下游产业链，基于大数据深度融合，

形成有效的产业协同；

2018 年 6 月注册用户数突破 4000 万；

2018 年 6 月获得 C1 轮融资，由上合资本领投，老股东兰馨亚洲、经纬**、复星

锐正跟投，融资累计过亿美金；

2018 年 7 月举办首届金犀奖全球结婚产业潮流峰会，创立“金犀奖”被行业和媒

体评为最重要的商家评级指南、结婚界《米其林红色宝典》。

融资折合成人民币

联系方式

lu_fei#hunliji.com

ubuntu 18.04 体验

ubuntu 的软件图标配置

ubuntu 下很多原件是下载 tar 包，解压之后就可以用了。但是没有图标，无法在 dash 板上固定。
比如现在想要给 goland 加个启动按钮，可以这么操作
sudo gedit /usr/share/applications/goland.desktop

#!/usr/bin/env xdg-open
[Desktop Entry]
Encoding=UTF-8
Name=GoLand
Comment=go lang develop tool
## software location
Exec=/home/m2/software/GoLand-2018.2.1/bin/goland
Icon=/home/m2/software/GoLand-2018.2.1/bin/goland.png
Terminal=false
StartupNotify=true
Type=Application
Categories=Application;Development;

科学上网

软件安装
- [apt-get install]
  sudo add-apt-repository ppa:hzwhuang/ss-qt5sudo apt-get updatesudo apt-get install shadowsocks-qt5
- tar.gz install
浏览器插件

copy-paste-cli

ubuntu shell copy
“选择＝复制，中键单击＝粘贴”。“Ctrl＋Shift＋C＝复制，Ctrl＋Shift＋V＝粘贴”

终端 open 配置

sudo apt-get install libgnome2-bin
gnome-open file

https://www.vmware.com/products/workstation-pro/workstation-pro-evaluation.html

参考

一次 NTP 放大 Ddos 的攻击样本研究

电商网站中流量平衡的研究

web API 性能优化

测试 qps

工具 http_load

make; make install

http_load 
usage:  http_load [-checksum] [-throttle] [-proxy host:port] [-verbose] [-timeout secs] [-sip sip_file]
            -parallel N | -rate N [-jitter]
            -fetches N | -seconds N
            url_file
One start specifier, either -parallel or -rate, is required.
One end specifier, either -fetches or -seconds, is required.

测试

 http_load -parallel 5 -fetches 1000 url.txt
1000 fetches, 5 max parallel, 53000 bytes, in 3.66203 seconds
53 mean bytes/connection
273.072 fetches/sec, 14472.8 bytes/sec
msecs/connect: 0.053083 mean, 0.316 max, 0.017 min
msecs/first-response: 18.2421 mean, 70.557 max, 7.095 min
HTTP response codes:
  code 200 -- 1000

当我进行到业务逻辑的时候, 惨淡的性能

1000 fetches, 5 max parallel, 1.46075e+06 bytes, in 801.51 seconds
1460.75 mean bytes/connection
1.24764 fetches/sec, 1822.5 bytes/sec
msecs/connect: 0.056571 mean, 0.306 max, 0.01 min
msecs/first-response: 3380.38 mean, 6363.63 max, 739.752 min
11 timeouts
11 bad byte counts
HTTP response codes:
  code 200 -- 989

系统表现, 一下子就把 cpu 打满

# web 服务 top
25701 m2        20   0 1395112 191568  29836 S  99.3  0.6  13:59.36 python

分析业务逻辑代码，找出耗时部分

import pandas as pd
...
    # data = pd.DataFrame()
    # for category, group in svm_data.groupby('category'):
    #     group_size = int(category_num_map.get(category, 0))
    #     if not group_size:
    #         continue
    #     group['group'] = map(lambda x: x / group_size, range(group.shape[0]))
    #     group['order'] = category_order_map.get(category, 99999)
    #     data = data.append(group)

雷

埋了这么一个雷 data = data.append(group)
DataFrame.append 底层调用的是这个函数

https://github.com/pandas-dev/pandas/blob/52cffa3b3b2a510c30ed7f8cc8525c03d62e9130/pandas/core/reshape/concat.py#L216

官方文档说明

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html#pandas-dataframe-append

Notes

If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

DataFrame.append 返回一个对象，每次都是一个拷贝
换成 list 性能明显提升

参考

https://coolshell.cn/articles/7490.html

es 搜索文档相似性算法

https://www.elastic.co/guide/cn/elasticsearch/guide/current/pluggable-similarites.html

异地机房数据同步研究

# redis 一次故障排查

redis 一次故障排查

在流量控制设计中，用 redis 缓存流量实时值，redis 中缓存中一个 key 的 values 值是个 set，并且 set 的值有3w+，每个值的存储的是 string，大多在6-7位。所以一个键值对的大小就在 200k 以上。

监控图

可以看到 cpu usage， network usage 两项指标前半部分都很高。cpu 使用率有毛刺产生。

遇到这个问题，我记得 redis value 值太大，对 redis 的 qps 影响还是比较大的。在数据包大小在 1k 的时候是个性能拐点。

解决方案

cpu、network 指标比较高
流量限制是 10分钟批量更新的，更新的数据及时维护在 redis 中，在对外接口上过滤之前也定时缓存一个流量限制变量（内部缓存），这样对 redis 获取流量限制操作在 10 分中只要操作一次。
cpu 有毛刺
观察毛刺时间间隔，基本是有规律的（1分钟到2分钟一个脉冲），猜测跟定时有关。考虑从 redis 的日志触发。如下：

# redis log file
 sudo tail -f /var/log/redis/redis.log
18013:C 04 Jan 16:41:36.609 * RDB: 905 MB of memory used by copy-on-write
16206:M 04 Jan 16:41:36.630 * Background saving terminated with success
16206:M 04 Jan 16:42:37.018 * 10000 changes in 60 seconds. Saving...
16206:M 04 Jan 16:42:37.035 * Background saving started by pid 18074
18074:C 04 Jan 16:42:50.182 * DB saved on disk
18074:C 04 Jan 16:42:50.191 * RDB: 918 MB of memory used by copy-on-write
16206:M 04 Jan 16:42:50.289 * Background saving terminated with success

通过上面基本定位是数据落磁盘的 cpu 开销。跟 redis 的默认配置有关：

sudo vi /etc/redis.conf

参考文章

https://ningyu1.github.io/site/post/32-redis-aof/
https://www.cnblogs.com/mindwind/p/5067905.html

m2shad0w / blog Goto Github PK

blog's People

Contributors

Stargazers

Watchers

Forkers

blog's Issues

data-hub json -> protobuf 的性能提升

响应时间

结论

参考文章

ambari 升级记录

presto 实践优化

最优化理论在 cpm 广告系统中的决策

refer

线下环境 dns 解析定位

考虑使用抓包方式分析一下

参考

构建搜索引擎

数据实时同步

数据离线同步

es 远程访问的坑

推荐系统实践

打算把开发 ide 换成 sublime 3 简单做个记录

设置 tab 键自动转换成 space

pep8 自动格式化 py 代码

markdown 预览插件

sublime 3 支持 js 开发插件安装

私有源用于团队多项目应用

私有源服务选择

搭建过程

验证权限

安装私有源

cpm 广告实践

整理一下机器学习、深度学习资料

运筹学

量化交易

merge file

distcp

error info

hack

hdfs 资源倾斜

数据挖掘 JD 描述

职位诱惑：

算法

工作职责：

任职要求：

加分项：

数据开发

职位描述：

工作职责：

任职要求：

加分项：

内在要求：

对标人才

婚礼纪发展

融资折合成人民币

联系方式

ubuntu 的软件图标配置

科学上网

copy-paste-cli

终端 open 配置

next

参考

一次 NTP 放大 Ddos 的攻击样本研究

电商网站中流量平衡的研究

web API 性能优化

测试 qps

当我进行到业务逻辑的时候, 惨淡的性能

系统表现, 一下子就把 cpu 打满

分析业务逻辑代码，找出耗时部分

雷

官方文档说明

参考

es 搜索文档相似性算法

redis 一次故障排查

redis 一次故障排查

监控图

解决方案

参考文章