我有一个数据集,我试图找到roximateile。它适用于样品集,但不适用于窗口功能。
+------------+--------------------------------------+---------------------+
|original |variation |similarity_score |
+------------+--------------------------------------+---------------------+
|pencils |pencils |0.982669767327994 |
|pencils |pencils ticonderoga |0.8875609147113148 |
|pencils |pencils bulk |0.5536876549778099 |
|pencils |pencils for kids |0.39102614317876977 |
|pencils |mechanical pencils |0.3837525124443511 |
|pencils |school supplies |0.36800207529412093 |
|pencils |pencils mechanical |0.20423055289450207 |
|pencils |black pencils |0.08241323295822053 |
|pencils |erasers |0.08101804016695552 |
|pencils |papermate pencils |0.08091683972455012 |
|pencils |pensils for kids |0.07299289422964994 |
|pencils |loose leaf paper |0.07113136587149338 |
|pencils |pencil sharpener |0.0684130629091813 |
|pencils |pencils with sayings |0.06350472916694737 |
|pencils |cute pencils |0.058316002229988215 |
|pencils |colored pencils |0.05552992878175486 |
|pencils |pensils sharpener |0.0491689934641725 |
|pencils |pens |0.048082618970934014 |
|pencils |mechanical pencils 0.5 |0.04730075285308284 |
|pencils |pencil box |0.04537707727651545 |
|pencils |pencil case |0.04408623373105654 |
|pencils |pensils 0.5 |0.042054450870644036 |
|pencils |crayons |0.03968021119078101 |
|pencils |cool pencils |0.03952088726420226 |
|pencils |pen |0.037752562111173546 |
|pencils |cool pensils |0.037510831184747497 |
|pencils |glue sticks |0.032787653384488386 |
|pencils |drawing pencils |0.032398405257143804 |
|pencils |cute pensils |0.031057982991620214 |
|pencils |cool penciles |0.02964868546657146 |
|pencils |art pencils |0.027646328736291175 |
|pencils |index cards |0.02744109743018355 |
|pencils |folders |0.027070809949858353 |
|pencils |pensil cases |0.0245633981485688 |
|pencils |golf pencils |0.02411058428627058 |
|pencils |notebooks |0.023534237276835582 |
|pencils |binder |0.02262728927378412 |
|pencils |pens bulk |0.021022196010267332 |
|pencils |folders with pockets |0.020883099832185503 |
|pencils |amazon basics pens |0.02010866875890696 |
|pencils |notebook |0.02002518175509854 |
|pencils |pencil grips |0.01712161261500511 |
|pencils |scissors |0.014319987175720715 |
|pencils |paper |0.013844962316376306 |
|pencils |tissues |0.012853953323240916 |
|pencils |sketch book |0.011383006178195793 |
|pencils |colored pens |0.010608837622836953 |
|pencils |printer paper |0.007159662743211316 |
|pencils |pencil holder |0.005219375924877335 |
|pencils |copy paper |0.0048398059676751275|
|pencils |paper clips |0.004323946307993657 |
|pencils |backpack |0.002456516357354455 |
|pencils |amazonbasics |0.002386583749856832 |
|pencils |desk organizer |0.0014871301773208652|
|pencils |fun penciles |7.114920197678272E-4 |
|pencils |pensils crayola |5.863935694892475E-4 |
|pencils |paper towels |4.956030568975998E-4 |
|pencils |pensils usa |3.074806676441355E-4 |
|pencils |golf pensils |1.6343203531543173E-4|
|pencils |carpenter penciles |1.537148743666664E-4 |
|pencils |usb extension cable |9.276356621610018E-5 |
|baby monitor|baby monitor m6s |0.9497291269175191 |
|baby monitor|baby monitor |0.9449706931157679 |
|baby monitor|baby monitor audio |0.9194544999991849 |
|baby monitor|baby monitor m7 |0.6828123503685779 |
|baby monitor|video baby monitor |0.6587247331182673 |
|baby monitor|motorola baby monitor |0.4112370152461352 |
|baby monitor|wifi baby monitor |0.32615298083708877 |
|baby monitor|baby monitor with camera |0.32488468999022607 |
|baby monitor|ibaby monitor m2 |0.22209678642286393 |
|baby monitor|babybmonitor |0.22201449474747625 |
|baby monitor|baby monitor motorola |0.15539204748659272 |
|baby monitor|ibaby monitor wall mount |0.15534317135388104 |
|baby monitor|baby monitor wifi |0.1452555652119229 |
|baby monitor|infant optics dxr-8 video baby monitor|0.1254649582976917 |
|baby monitor|babymonitor video |0.07508481508302599 |
|baby monitor|baby ping |0.07479584563204815 |
|baby monitor|summer baby monitor |0.06805384082762049 |
|baby monitor|arlo baby monitor |0.06026686625908922 |
|baby monitor|summer infant monitor |0.05179703845795807 |
|baby monitor|baby monitor |0.05132347606161915 |
|baby monitor|nanit baby monitor |0.05098268103975827 |
|baby monitor|owlet baby monitor |0.049905204987139684 |
|baby monitor|baby crib |0.04241570098880047 |
|baby monitor|cocoon cam |0.042065782030944146 |
|baby monitor|angelcare baby monitor |0.03879796480510425 |
|baby monitor|nest camera indoor |0.03752577054574136 |
|baby monitor|hello baby monitor |0.036520838537481 |
|baby monitor|babysense video monitor |0.03601327242576456 |
|baby monitor|lollipop monitor |0.03546663357189976 |
|baby monitor|baby |0.03437681615854988 |
|baby monitor|baby breathing monitor |0.029493224000661625 |
|baby monitor|baby doppler |0.02356697418340991 |
|baby monitor|baby gate |0.021958869140002453 |
|baby monitor|baby heartbeat monitor |0.014386588623583382 |
|baby monitor|arlo |0.013883004625408599 |
|baby monitor|security camera |0.013712142985200558 |
|baby monitor|arlo monitor |0.01047770092532903 |
|baby monitor|wyze cam |0.00661600182771142 |
|baby monitor|eufy babymonitor |0.004696950719180201 |
|baby monitor|yi home camera |0.004032626118694353 |
|baby monitor|monitor |0.0028295679527875124|
|baby monitor|baby registry |0.0027895987025737122|
|baby monitor|diaper genie |0.0026936137516863717|
|baby monitor|crib mattress |0.0022388567663420476|
+------------+--------------------------------------+---------------------+
当我尝试时-
val quantiles = df.filter($"original" === "baby monitor").stat.approxQuantile("similarity_score",
Array(0.20,0.65),0.0)
val Q1 = quantiles(0)
val Q3 = quantiles(1)
val IQR = Q3 - Q1
val lowerRange = Q1 - 1.5*IQR
val upperRange = Q3+ 1.5*IQR
val outliers = df.filter($"original" === "baby monitor").filter(s"similarity_score < $lowerRange or similarity_score > $upperRange")
这给
+------------+------------------------+-------------------+
|original |variation |similarity_score |
+------------+------------------------+-------------------+
|baby monitor|baby monitor m6s |0.9497291269175191 |
|baby monitor|baby monitor |0.9449706931157679 |
|baby monitor|baby monitor audio |0.9194544999991849 |
|baby monitor|baby monitor m7 |0.6828123503685779 |
|baby monitor|video baby monitor |0.6587247331182673 |
|baby monitor|motorola baby monitor |0.4112370152461352 |
|baby monitor|wifi baby monitor |0.32615298083708877|
|baby monitor|baby monitor with camera|0.32488468999022607|
|baby monitor|ibaby monitor m2 |0.22209678642286393|
|baby monitor|babybmonitor |0.22201449474747625|
+------------+------------------------+-------------------+
如何使用窗口函数执行此操作,以便基于原始列对所有数据集执行相同的操作:
val window = Window.partitionBy("original")
我不确定如何在Spark中惯用地进行此操作,但是如果运行多个计算不是问题,则可以
df.select("original").distinct().collect().map { value =>
// your code here
}.reduce(_ union _)