我正在努力提高搜索结果的相关性得分,我有几个候选人资料,我正在根据他们在行业中所扮演的技能和角色搜索最佳候选人资料。
我已经提出了排名概况,并使用它来寻找最相关的候选人。我同时使用词汇+语义。
这里的挑战是 vespa 生成的相关性分数不是很好,我想微调排名和相关性分数。
对此的任何提示将不胜感激!.
我想: A。提高此配置文件的相关性分数。 b.
bm25(skills)
和 matchfeatures
中的 summaryfeature
值均为 0.0,而实际上它同时具有 java
和 python
。
输出:
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 143
},
"coverage": {
"coverage": 100,
"documents": 143,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
"children": [
{
"id": "id:candidate_profile:candidate_profile::a866fa7f-7e13-48fe-bdca-5a60a3198fd9",
"relevance": 0.01639344262295082,
"source": "candidate_profile",
"fields": {
"matchfeatures": {
"bm25(profile_summary)": 5.470910610067547,
"bm25(skills)": 0,
"firstPhase": 0.8789145673605757,
"nativeRank(profile_summary)": 0.08308099301928237,
"semantic": 0.8789145673605757
},
"skills": [
"HTML",
"CSS",
"Java Script",
"React Js",
"Python",
"Web Designing",
"Leadership",
"Teamwork",
"Observation",
"Time management",
"Communication",
"Avid fitness enthusiast",
"Volunteering",
"Sports",
"English",
"Hindi"
],
"summaryfeatures": {
"bm25(latest_industry)": 0,
"bm25(latest_job_title)": 0,
"bm25(latest_role)": 0,
"bm25(profile_summary)": 5.470910610067547,
"bm25(skills)": 0,
"embedding_sum": 55.06214759836439,
"latest_industry_sum": 40.86598728704121,
"latest_role_sum": 0,
"skill_sum": 52.88688380786334,
"vespa.summaryFeatures.cached": 0
}
}
}
]
}
}
我在 Vespa DB 中运行的查询:
"yql" : " select * from candidate_profile WHERE userQuery() or (all_role_title matches 'Software Developer') AND (skills matches 'python' OR skills matches 'java') AND (latest_role_title matches 'Senior Developer') or ({scoreThreshold:0.032 ,targetHits: 4}nearestNeighbor(embedding, e))",
"input.query(e)" : 'embed(e5, "query: Candidate who is working as Software Developer, Senior Developer has the following skills python, java.")',
"query": " Candidate who is working as Software Developer, Senior Developer has the following skills python, java.",
"ranking" : "common"
我创建的排名档案:
rank-profile common {
weight skills : 500
weight latest_role : 500
weight latest_industry : 500
weight latest_job_title : 400
inputs {
query(e) tensor<float>(x[384])
}
function semantic() {
expression: max(0, cos(distance(field, embedding)))
}
function semantic_skills() {
expression: max(0, cos(distance(field, skills_embedding)))
}
function semantic_latest_role() {
expression: max(0, cos(distance(field, latest_role_embedding)))
}
function semantic_latest_job_title() {
expression: max(0, cos(distance(field, latest_job_title_embedding)))
}
function semantic_latest_industry() {
expression: max(0, cos(distance(field, latest_industry_embedding)))
}
function keyword_match(){
expression: bm25(skills) + bm25(latest_role) + bm25(latest_industry) + bm25(latest_job_title)
}
first-phase {
expression: sum(keyword_match + semantic)
}
rank-properties {
fieldMatch(skills).occurrenceImportance: 0.5
fieldMatch(skills).proximityCompletenessImportance: 0.9
bm25(skills).k1: 1.5
bm25(skills).b: 0.85
fieldMatch(profile_summary).occurrenceImportance: 0.5
fieldMatch(profile_summary).proximityCompletenessImportance: 0.9
bm25(profile_summary).k1: 1.5
bm25(profile_summary).b: 0.85
}
summary-features: embedding_sum skill_sum latest_role_sum latest_industry_sum bm25(profile_summary) bm25(skills) bm25(latest_role) bm25(latest_industry) bm25(latest_job_title)
function embedding_score() {
expression: attribute(embedding) * query(e)
}
function embedding_sum() {
expression: sum(embedding_score)
}
function skill_score(){
expression : attribute(skills_embedding) * query(e)
}
function skill_sum(){
expression : sum(skill_score)
}
function latest_role_score(){
expression : attribute(latest_role_embedding) * query(e)
}
function latest_role_sum(){
expression : sum(latest_role_score)
}
function latest_industry_score(){
expression : attribute(latest_industry_embedding) * query(e)
}
function latest_industry_sum(){
expression : sum(latest_industry_score)
}
match-features {
bm25(skills)
bm25(profile_summary)
nativeRank(profile_summary)
semantic
firstPhase
}
global-phase {
expression {
reciprocal_rank(semantic)
}
}
}
我能够获得bm25(技能)分数,还生成其他匹配字段的分数。
我的发现:
bm25 是一个纯文本排名功能,它对索引字符串字段进行操作,在我们的例子中,技能是索引字段,但类型是数组。因此,我们将值更改为逗号分隔或将类型更改为字符串。 参考:bm25
第一步之后,您必须在查询中使用 rank() 运算符。
查询示例:
"yql" : " select * from candidate_profile WHERE rank((all_role_title matches 'senior') AND (skills matches 'python' OR skills matches 'java') AND (latest_role_title matches 'developer') or ({targetHits: 40}nearestNeighbor(embedding, e)),userQuery())",
"input.query(e)" : 'embed(e5, "query: Candidate who is working as senior, developer has the following skills python, java.")',
"query": " Candidate who is working as senior, developer has the following skills python, java.",
"ranking" : "common"
搜索输出:
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 40
},
"coverage": {
"coverage": 100,
"documents": 144,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
"children": [
{
"id": "id:candidate_profile:candidate_profile::85715181-73f9-4f61-9398-4e350e41e989",
"relevance": 22.045853545320217,
"source": "candidate_profile",
"fields": {
"matchfeatures": {
"bm25(profile_summary)": 35.56386309994314,
"bm25(skills)": 11.98552149345962,
"firstPhase": 22.045853545320217,
"semantic": 0.9177947832357726
},
"sddocname": "candidate_profile",
"documentid": "id:candidate_profile:candidate_profile::85715181-73f9-4f61-9398-4e350e41e989",
"first_name": "DEEPAK",
"middle_name": "SINGH",
"city": "Bengaluru",
"gender": "Male",
"skills": [
"Java,Springboot,J2EE,Hibernet,AWS,C/C++,Core Java,Python Programming"
],
"total_months_of_experience": 98,
"candidate_type": "New_candidate",
"languages": [
"English",
"Kannada",
"Hindi",
"Telugu"
],
"has_own_vehicle": false,
"profile_summary": "The candidate has an experience of 8.2 years and is working as Developer, Senior Developer and has the following skills Java,Springboot,J2EE,Hibernet,AWS,C/C++,Core Java,Python Programming in industries like Software.",
"latest_organisation_name": "Flipkart",
"latest_job_title": "Senior Developer",
"latest_role": "Developer",
"latest_industry": "Software",
"latest_employment_type": "Permanent",
"employment_history": [
{
"role": "Developer",
"job_title": "Senior Developer",
"employment_type": "Permanent",
"is_current_job": 1,
"industry": "Software",
"organisation_name": "Flipkart"
}
],
"highest_education_level": "Not mentioned",
"highest_course_is_full_time": false,
"highest_course_is_highest_qualification": false,
"financials": [
{}
],
"summaryfeatures": {
"bm25(candidate_type)": 0,
"bm25(highest_course_name)": 0,
"bm25(highest_education_level)": 0,
"bm25(highest_specialization)": 0,
"bm25(languages)": 0,
"bm25(latest_employment_type)": 0,
"bm25(latest_industry)": 0,
"bm25(latest_job_title)": 4.5712686343124105,
"bm25(latest_role)": 4.5712686343124105,
"bm25(profile_summary)": 35.56386309994314,
"bm25(skills)": 11.98552149345962,
"embedding_sum": 58.45279276280053,
"latest_industry_sum": 48.31929411615329,
"latest_role_sum": 51.93824407275679,
"skill_sum": 52.75094562502136,
"vespa.summaryFeatures.cached": 0
}
}
}
]
}
bm25(skills) 为 0 的原因是查询不搜索技能字段:仅针对搜索的字段填充匹配特征。
您可以通过在查询中使用 RANK 项来搜索它,而不影响召回。
其余的 - 我如何获得与我的用例的巨大相关性 - 更适合会议。