我正在进行数据迁移,在旧系统上,用户可以在大型文本字段中输入他们的兴趣,而根本不需要遵循任何格式说明。因此,一些人以生物格式编写,另一些人以逗号分隔的列表格式编写。还有一些其他格式,但这些是主要的格式。
现在我知道如何识别逗号分隔列表(CSL)。这很容易。但是,如何确定一个字符串是 CSL(可能是包含两个术语或短语的短字符串)还是某人写的包含逗号的段落?
我的一个想法是自动忽略包含标点符号的字符串和不包含逗号的字符串。然而,我担心这还不够,或者还有很多不足之处。所以我想询问社区,看看大家的想法。同时我会尝试我的想法。
更新: 好吧,伙计们,我有我的算法。这是下面...
我的代码:
//Process our interests text field and get the list of interests
function process_interests($interests)
{
$interest_list = array();
if ( preg_match('/(\.)/', $interests) 0 && $word_cnt > 0)
$ratio = $delimiter_cnt / $word_cnt;
//If delimiter is found with the right ratio then we can go forward with this.
//Should not be any more the 5 words per delimiter (ratio = delimiter / words ... this must be at least 0.2)
if (!empty($delimiter) && $ratio > 0 && $ratio >= 0.2)
{
//Check for label with colon after it
$interests = remove_colon($interests);
//Now we make our array
$interests = explode($delimiter, $interests);
foreach ($interests AS $val)
{
$val = humanize($val);
if (!empty($val))
$interest_list[] = $val;
}
}
}
return $interest_list;
}
//Cleans up strings a bit
function humanize($str)
{
if (empty($str))
return ''; //Lets not waste processing power on empty strings
$str = remove_colon($str); //We do this one more time for inline labels too.
$str = trim($str); //Remove unused bits
$str = ltrim($str, ' -'); //Remove leading dashes
$str = str_replace(' ', ' ', $str); //Remove double spaces, replace with single spaces
$str = str_replace(array(".", "(", ")", "\t"), '', $str); //Replace some unwanted junk
if ( strtolower( substr($str, 0, 3) ) == 'and')
$str = substr($str, 3); //Remove leading "and" from term
$str = ucwords(preg_replace('/[_]+/', ' ', strtolower(trim($str))));
return $str;
}
//Check for label with colon after it and remove the label
function remove_colon($str)
{
//Check for label with colon after it
if (strstr($str, ':'))
{
$str = explode(':', $str); //If we find it we must remove it
unset($str[0]); //To remove it we just explode it and take everything to the right of it.
$str = trim(implode(':', $str)); //Sometimes colons are still used elsewhere, I am going to allow this
}
return $str;
}
感谢您的帮助和建议!
除了您提到的过滤之外,您还可以创建逗号数量与字符串长度的比率。在 CSL 中,该比率往往较高,而在段落中则较低。您可以设置某种阈值,并根据条目是否具有足够高的比率进行选择。比率接近阈值的可以被标记为容易出错,然后可以由主持人检查。