Git Product home page Git Product logo

recsys_spark's Introduction

recsys_spark

Spark SQL实现ItemCF,UserCF,Swing,推荐算法CF协同过滤召回模块

数据格式

商品交易数据,维度包括用户ID,商品ID,交易时间(userid,itemid,date),过滤掉黑名单用户和商品不参与计算

date userid itemid
2019-05-09 1901140040225006 103943
2019-05-09 1806041288325006 56610
2019-05-09 1812060050236636 16368
2019-05-09 1812060050261006 101562
2019-05-09 1901160070407006 79874

ItemCF(基于物品的协同过滤)

i2i2u算法,以用户曾经购买过的商品作为中间桥梁,连接用户和其他商品。 以商品共现作为相似度,对热门用户的长序列进行惩罚,相似度计算公式:

swing公式

UserCF(基于用户的协同过滤)

u2u2i算法,以用户作为中间桥梁,连接其他用户和商品 以用户共现作为相似度,对热门商品的长用户序列进行惩罚,相似度计算公式只需要把ItemCF公式中分子分母里面的i,j(商品1,商品2)换成u,v(用户1,用户2),用户序列 N(u)替换成 N(i)商品序列即可。

Swing(基于图的协同过滤)

i2i2u算法,以用户已经购买的商品作为中间桥梁,连接用户和其他商品。 为了衡量物品 i 和 j 的相似性,考察都购买了物品 i和 j 的用户 u 和 v, 如果这两个用户共同购买的物品越少,则物品 i 和 j 的相似性越高。相似度计算公式

swing公式

计算相似商品

phoenix查询hbase结果,ItemCF结果

item => [[item1, score],[item2, score]...]

spu recommend
00017_201209 [[201210,0.07535],[221502,0.03041],[215272,0.01753],[212219,0.01753],[228212,0.01688]
00042_103060 [[61212,0.03611],[10525,0.02616],[101486,0.03138],[91764,0.01898],[95527,0.02186],[661
0006d_25593 [[6598,0.00319],[11129,0.00762],[178,0.00696],[8558,0.0041],[11398,0.0029],[25536,0.012
00077_35837 [[25518,0.01044],[36420,0.41703],[36357,0.15762],[83810,0.02686],[103838,0.02686],[1038
0007c_9700 [[227970,0.03401],[219462,0.02626],[219401,0.02626],[223635,0.02247],[223641,0.02247],[2
000cb_33363 [[8572,0.00877],[19665,0.00756],[12812,0.01092],[11853,0.0094],[8528,0.01173],[1705,0.0
000d0_50738 [[119582,0.03503],[100296,0.02922],[97248,0.02309],[72044,0.02153],[79245,0.02023],[119
000d5_68111 [[50729,0.00632],[67871,0.02315],[68081,0.01277],[9624,0.01253],[57234,0.00996],[67983,
000dd_45311 [[3721,0.02095],[21908,0.0156],[25633,0.01145],[5002,0.01438],[28633,0.02605],[17088,0.

计算推荐结果

user => [item1, item2, item3...]

userid recommend
00000_180731 [50648,14253,211049,14255,209517,112985,48507,13458,206846,35472,18769,97610,78105,21
00003_532933 [203038,78262,81480,120623,203040,81447,100994,203009,101491,81457,114550,55115,80139
00007_552871 [105023,10199,100894,100565,99769,96980,30781,115965,230960,95059,11129,104702,51831,6
0000b_194813 [231082,60365,101950,57700,209504,113725,101939,5906,94771,59979,237823,102324,229264
0000e_398677 [210020,210019,74081,91787,48428,90769,17449,91800,91822,17448,91823,91803,17437,1162
0000e_590120 [106907,72369,94907,74972,79603,97245,202614,97243,207393,229353,74063,78596,210969,11
00010_180604 [73633,24509,24507,7481,101877,107612,116350,100115,34379,229431,113725,229618,236254,
00011_536634 [209481,210381,112120,234451,113968,119215,64699,121035,106867,121057,103750,48503,12,
00013_180604 [212154,212156,212157,17141,62421,69801,232732,62407,211132,211029,37857,215047,8741,6

recsys_spark's People

Contributors

xiaogp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

recsys_spark's Issues

关于usercf

在u2u2i的过程中,没有看到找相似user的过程。或者说,
val df_user_prefer2 = df_user_prefer1.withColumn("score", col("pref") * col("similar") * (lit(1) / log(col("sum_item") * hot_item_regular + math.E))).select("useridJ", "itemid", "score")
这一行,为什么可以直接去掉useridI这一项呢?

UserCF

    // 用户共现矩阵
    val item_user2 = item_user.flatMap { row =>
      val userlist = row.getAs[scala.collection.mutable.WrappedArray[Long]](1).toArray.sorted
      val result = new ArrayBuffer[(Long, Long, Double)]()
      for (i <- 0 to userlist.length - 2) {
        for (j <- i + 1 to userlist.length - 1) {
          result += ((userlist(i), userlist(j), 1.0 / math.log(1 + userlist.length))) // 热门商品惩罚
        }
      }
      result
    }.withColumnRenamed("_1", "useridI").withColumnRenamed("_2", "useridJ").withColumnRenamed("_3", "score")

这段代码对于大数据量会oom 每个item都会有上百万个user对他产生行为,大佬该怎么优化呢

swing to use

it is obvious that the implementation of swing has joined many times,the effectiveness can be ensured ? And flatmap also bring about serious data skew.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.