Array
(
[0] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
[1] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
[2] => The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.
[3] => Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
[4] => The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.
[5] => For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:
[6] => The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.
[7] => The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.
[8] => For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:
[9] => The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.
)
并不完全相同,可以使用
array_unique
删除,但是被包含完全相同的数据和更多数据的另一个元素渲染过时的元素,或者有时只是几个单词不同。
如何过滤这些?
首先,问题并不那么简单,而且表述得不够好:你不想删除相同元素,你想删除相似元素,所以你的第一个问题变成确定哪些元素相似。
鉴于相似性可能发生在字符串中的任何点,因此要求它们以相同的字符集开头是不够的。例如,采用以下两个句子(改编自您的问题):
Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
The rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
它们非常相似,但不以相同的字符串开头。确定相似性度量的一种方法是Smith–Waterman_algorithm,这里有一个 PHP 实现。
--- 稍后编辑---
这里是使用PHP内置的实现similar_text()
/**
* @param mixed $array input array
* @param int $minSimilarity minimum similarity for an item to be removed (percentage)
* @return array
*/
function applyFilter ($array, $minSimilarity = 90) {
$result = [];
foreach ($array as $outerValue) {
$append = true;
foreach ($result as $key => $innerValue) {
$similarity = null;
similar_text($innerValue, $outerValue, $similarity);
if ($similarity >= $minSimilarity) {
if (strlen($outerValue) > strlen($innerValue)) {
// always keep the longer one
$result[$key] = $outerValue;
}
$append = false;
break;
}
}
if ($append) {
$result[] = $outerValue;
}
}
return $result;
}
$test = [
'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.',
'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.',
'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.',
'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:',
'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.',
'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.',
'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:',
'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.',
];
var_dump(applyFilter($test));
--- EOF 稍后编辑 ---
这是 Smith–Waterman_algorithm 的完整工作代码:
class SmithWatermanGotoh
{
private $gapValue;
private $substitution;
/**
* Constructs a new Smith Waterman metric.
*
* @param gapValue
* a non-positive gap penalty
* @param substitution
* a substitution function
*/
public function __construct($gapValue=-0.5,
$substitution=null)
{
if($gapValue > 0.0) throw new Exception("gapValue must be <= 0");
//if(empty($substitution)) throw new Exception("substitution is required");
if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0);
else $this->substitution = $substitution;
$this->gapValue = $gapValue;
}
public function compare($a, $b)
{
if (empty($a) && empty($b)) {
return 1.0;
}
if (empty($a) || empty($b)) {
return 0.0;
}
$maxDistance = min(mb_strlen($a), mb_strlen($b))
* max($this->substitution->max(), $this->gapValue);
return $this->smithWatermanGotoh($a, $b) / $maxDistance;
}
private function smithWatermanGotoh($s, $t)
{
$v0 = [];
$v1 = [];
$t_len = mb_strlen($t);
$max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0));
for ($j = 1; $j < $t_len; $j++) {
$v0[$j] = max(0, $v0[$j - 1] + $this->gapValue,
$this->substitution->compare($s, 0, $t, $j));
$max = max($max, $v0[$j]);
}
// Find max
for ($i = 1; $i < mb_strlen($s); $i++) {
$v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0));
$max = max($max, $v1[0]);
for ($j = 1; $j < $t_len; $j++) {
$v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue,
$v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j));
$max = max($max, $v1[$j]);
}
for ($j = 0; $j < $t_len; $j++) {
$v0[$j] = $v1[$j];
}
}
return $max;
}
}
class SmithWatermanMatchMismatch
{
private $matchValue;
private $mismatchValue;
/**
* Constructs a new match-mismatch substitution function. When two
* characters are equal a score of <code>matchValue</code> is assigned. In
* case of a mismatch a score of <code>mismatchValue</code>. The
* <code>matchValue</code> must be strictly greater then
* <code>mismatchValue</code>
*
* @param matchValue
* value when characters are equal
* @param mismatchValue
* value when characters are not equal
*/
public function __construct($matchValue, $mismatchValue) {
if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue");
$this->matchValue = $matchValue;
$this->mismatchValue = $mismatchValue;
}
public function compare($a, $aIndex, $b, $bIndex) {
return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue
: $this->mismatchValue);
}
public function max() {
return $this->matchValue;
}
public function min() {
return $this->mismatchValue;
}
}
/**
* @param mixed $array input array
* @param int $minSimilarity minimum similarity for an item to be removed (percentage)
* @return array
*/
function applyFilter ($array, $minSimilarity = 90) {
$swg = new SmithWatermanGotoh();
$result = [];
foreach ($array as $outerValue) {
$append = true;
foreach ($result as $key => $innerValue) {
$similarity = $swg->compare($innerValue, $outerValue) * 100;
if ($similarity >= $minSimilarity) {
if (strlen($outerValue) > strlen($innerValue)) {
// always keep the longer one
$result[$key] = $outerValue;
}
$append = false;
break;
}
}
if ($append) {
$result[] = $outerValue;
}
}
return $result;
}
$test = [
'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.',
'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.',
'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.',
'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:',
'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.',
'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.',
'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:',
'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.',
];
var_dump(applyFilter($test));
现在您只需根据需要调整 $minSimilarity 变量即可。例如,在您的情况下,如果保留默认的 90%,将删除第一个元素(与第二个元素类似,达到 99.86% 的程度)。但是,设置较低的值 (80%) 也会删除第 8 个元素。
希望有帮助!
假设该值总是出现在最开始,你可以这样做:
$arr = ["Some Text.", "Some Text. And more details."];
foreach($arr as $key => $value) {
// Look for the value in every element
foreach($arr as $key2 => $value2) {
// Remove element if its value appears at the beginning of another element
if ($key !== $key2 && strpos($value2, $value) === 0) {
unset($arr[$key]);
continue 2;
}
}
}
// Re-index array
$arr = array_values($arr);
如果元素顺序相反,这也有效。
您仍然可以使用
array_filter
并使用自定义回调,使用 substr_count
查找该值是否在数组中出现多次
$input = array("a","b","c","d","ax","cz");
$str = implode("|",array_unique($input));
$output = array_filter($input, function($var) use ($str){
return substr_count($str, $var) == 1;
});
print_r($output);
有时只是几个词不同。
正如您所说,很少有单词可以与另一文本不同。但在编程中你需要精确的条件来过滤。
您可以输入匹配的百分比来过滤掉
这是一个基本示例,您可以从中获得想法。
<?php
$data = ["this is test","this is another test","one test","two test","this is two test"];
$percentageMatched = 100;//Here you can put your percentage matched to delete
for($i=0;$i<count($data)-1;$i++){
$value = explode(" ",$data[$i]);
/* check each word in another text */
for($k=$i+1;$k<count($data);$k++){
$nextArray = explode(" ",$data[$k]);
$foundCount = 0;
for($j=0;$j<count($value);$j++){
if(in_array($value[$j],$nextArray)){
$foundCount++;
}
}
$fromLine = $i;
$toLine = $k;
$percentage = $foundCount/count($value)*100;
echo "EN $fromLine matched $percentage % with EN $toLine \n";
if($percentage >= $percentageMatched){
$data[$i] = "";
break;
//array_values($data);
}
}
echo ".............\n";
}
print_r(array_filter($data));
?>
如果输入数据是:
Array
(
[0] => this is test
[1] => this is another test
[2] => one test
[3] => two test
[4] => this is two test
)
它给出输出:100%
matched percentage
这里索引 0 和 3 匹配 100% 并被过滤掉
EN 0 matched 100 % with EN 1
.............
EN 1 matched 25 % with EN 2
EN 1 matched 25 % with EN 3
EN 1 matched 75 % with EN 4
.............
EN 2 matched 50 % with EN 3
EN 2 matched 50 % with EN 4
.............
EN 3 matched 100 % with EN 4
.............
Array
(
[1] => this is another test
[2] => one test
[4] => this is two test
)
使用
array_filter
是一个不错的选择
$temp = "";
function prefixmatch($x){
global $temp;
$temp = $x;
// do an optimist linear search to determine if there's a prefix match
$bool = true;
for($i=0; $i < min([strlen($x), strlen($temp)]); $i++){
$bool = $bool & ($x[i] === $temp[i]);
}
// negate the result just because of array_filter
return(!$bool);
}
print_r(array_filter($array1, "prefixmatch"));
我认为词干提取和词形还原在这种情况下会很有帮助。如果我们考虑数组中前两个元素的情况,唯一的区别是单数“tape”和复数“tape”。
Array
(
[0] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
[1] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
如果您对字符串进行标记并将其传递给像 Php Stemmer 这样的词干分析器,那么单词“tape”和“tapes”都将被简化为它们的词干,即“tape”。词干后,您可以比较数组元素。我相信它会删除许多多余的元素。
您还可以更进一步,对字符串执行词形还原。例如,在英语中,动词“to walk”可能显示为“walk”、“walked”、“walks”、“walking”。人们可以在字典中查找的基本形式“walk”被称为该词的引理(来自维基百科)。
我个人使用的是Stanford NLP java。还有一个 Php 实现 PHP-Stanford-NLP
解决方案将取决于您对“相似性”的定义和数据集。一种情况与另一种情况可能确实不同。
可以满足您需求的一个解决方案是余弦相似度。这是代码示例:余弦相似度与汉明距离
我的答案更适合 cs.stackoverflow.com。
Java 代码片段。
像这样:
Set<T> mySet = new HashSet<>(Arrays.asList(someArray));
在Java 9+中,如果不可修改的设置可以:
Set<T> mySet = Set.of(someArray);
在 Java 10+ 中,可以从数组组件类型推断泛型类型参数:
var mySet = Set.of(someArray);
对于 PHP。
在 PHP 中,您可以使用
array_unique
方法从数组中删除重复项。
来自 php.net 的示例:
<?php
$input = array("a" => "green", "red", "b" => "green", "blue", "red");
$result = array_unique($input);
print_r($result);
?>
输出为:
Array
(
[a] => green
[0] => red
[1] => blue
)
希望这是您正在寻找的