我正在构建一个网络爬虫,它将爬行结果收集到 MySQL 表中。
有五个主要栏目:
URL, TITLE, DESCRIPTION, KEYWORDS, BODY
。
目前我正在使用MySQL的
FULLTEXT
搜索功能如下:
SELECT URL,title, description, MATCH (description, keywords, title, URL) AGAINST ('$keyword' in boolean mode)
AS score FROM record
WHERE MATCH (description, keywords, title, URL) AGAINST ('$keyword' in boolean mode) order by score desc;";
但这并没有给我带来好的结果。考虑下图。
这里,Facebook 的搜索排名为第 23 位
"Facebook"
。(?)
我可以根据列名称确定搜索的优先级吗?例如,我希望查询将最大优先级赋予
URL
,然后是 description
,然后是 title
,keywords
.. 最后是 body
.
有什么建议吗?
SELECT URL,title, description, MATCH (description, keywords, title, URL) AGAINST ('$keyword' in boolean mode) AS score FROM record WHERE URL LIKE '%$keyword%' OR MATCH (description, keywords, title, URL) AGAINST ('$keyword' in boolean mode) order by score desc;";
只需使用 LIKE 运算符进行 URL 匹配。参见上面的代码。谢谢你!
看看 SoundEx 之类的东西:
参见:http://www.madirish.net/?article=85
另外你可以不考虑自己做加权吗:(我本地没有MySQL,很抱歉半伪代码)
SELECT
URL
,title
, description
, MATCH (URL) AGAINST ('$keyword' in boolean mode) AS urlscore
, MATCH (description) AGAINST ('$keyword' in boolean mode) AS descscore
, MATCH (title) AGAINST ('$keyword' in boolean mode) AS titlescore
, MATCH (body) AGAINST ('$keyword' in boolean mode) AS bodyscore
,((MATCH (URL) AGAINST ('$keyword' in boolean mode))*4)
+ ((MATCH (description) AGAINST ('$keyword' in boolean mode))*3)
+ ((MATCH (title) AGAINST ('$keyword' in boolean mode))*2)
+ ((MATCH (body) AGAINST ('$keyword' in boolean mode))*1) AS weightedscore
FROM
record
WHERE
MATCH (description, keywords, title, URL) AGAINST ('$keyword' in boolean mode)
order by
((MATCH (URL) AGAINST ('$keyword' in boolean mode))*4)
+ ((MATCH (description) AGAINST ('$keyword' in boolean mode))*3)
+ ((MATCH (title) AGAINST ('$keyword' in boolean mode))*2)
+ ((MATCH (body) AGAINST ('$keyword' in boolean mode))*1) desc;