具有范围估计的 KNN 算法几乎总是返回超过 K 点

问题描述 投票:0回答:1

我正在为一个大学项目编写一个使用 R 树进行空间数据处理的算法,其中我们的任务是编写一个 KNN 算法作为使用范围估计的范围查询。我们不局限于单一编程语言,所以我选择了 C#。

我已经建立了R树、Point类和KNN算法的基础,但我不知道如何实现有效的范围估计函数,所以我在网上搜索找到现有的实现。我发现了一篇论文(“使用范围估计对远程空间数据库进行有效的 k 最近邻查询”),该论文提出了一种基于密度的解决方案,其中范围是根据 MBB(最小边界框)的密度来估计的整个数据集。

Density = number of points in MBB / area of MBB

这是我的课程的一部分,我在其中实现了论文中编写的功能:

public class KNNRangeQueryRTreeDensity {

    ... (constructor and pivate fields omitted)
    
    private double GetDensityOfDatasetByGeometry(Geometry geom)
    {
        int numOfObjs = geom.NumPoints;
        Envelope env = geom.EnvelopeInternal;
        double area = (env.MaxX - env.MinX) * (env.MaxY - env.MinY);
        return numOfObjs / area;
    }
    
    public double EstimateRangeWithDensity(int k, Geometry databaseGeometry)
    {
        return Math.Sqrt(k / (Math.PI * GetDensityOfDatasetByGeometry(databaseGeometry)));
    }
    
    public double ExpandRange(int k, Point queryPoint, Envelope prevEnv, int numOfNNRetrieved)
    {
        double range, envDensity;
    
        if (numOfNNRetrieved == 0)
        {
            range = 2 * ((prevEnv.MaxX - prevEnv.MinX) / 2);
        }
        else
        {
            envDensity = numOfNNRetrieved / ((prevEnv.MaxX - prevEnv.MinX) * (prevEnv.MaxY - prevEnv.MinY));
            range = Math.Sqrt(k / (Math.PI * envDensity));
        }
    
        return range;
    }
    
    public (List<(Point, double)>, double) KNNQuery(int k, Point queryPoint)
    {
        double dist;
        double range = EstimateRangeWithDensity(k, _pointsGeometry); //perform range estimation
        Envelope window = new Envelope(queryPoint.X - range, queryPoint.X + range, queryPoint.Y - range, queryPoint.Y + range); //create a new window (envelope) using the query point and range
    
        IList<Point> resultsList = _rTree.Query(window).OrderBy(x => x.DistanceTo(queryPoint)).ToList(); //perform window query
        List<(Point, double)> nnList = new List<(Point, double)>(k); //create list with a capacity of k
        int count = 0; //this keeps track of how many points are actually added to nnList
        foreach (Point point in resultsList) //perform initial query based on the range estimation above
        {
            dist = point.DistanceTo(queryPoint);
    
            if (dist <= range)
            {
                nnList.Add((point, dist)); //add a tuple consisting of the point and its distance to the query point to the list
                count++;
            }
        }
    
        /* the above range estimate is not loose and because of that it might fail (i.e. may return less than k)
         * when that happens, we need to further refine the range estimate. */
        while (count < k)
        {
            range = ExpandRange(k, queryPoint, window, count); //expand range
            window = new Envelope(queryPoint.X - range, queryPoint.X + range, queryPoint.Y - range, queryPoint.Y + range);
    
            //repeat above code block
            resultsList = _rTree.Query(window).OrderBy(x => x.DistanceTo(queryPoint)).ToList(); //perform window query
            count = 0;
            foreach (Point point in resultsList) //perform initial query based on the range estimation above
            {
                dist = point.DistanceTo(queryPoint);
    
                if (dist <= range)
                {
                    if (!nnList.Contains((point, dist))) //check if point already exists in list
                    {
                        nnList.Add((point, dist));
                    }
    
                    count++;
                }
            }
        }
    
        return (nnList, range);
        }
    }
}

(注意:对于

R-tree
Geometry
Envelope
Coordinate
对象,我使用的是 .NET 拓扑套件
Point
类是我的。)

Point
班级:

public class Point
{
    public double X { get; set; }
    public double Y { get; set; }

    public Point(double x, double y)
    {
        X = x;
        Y = y;
    }

    public double DistanceTo(Point o)
    {
        return Math.Sqrt(Math.Pow(X - o.X, 2) + Math.Pow(Y - o.Y, 2));
    }

    public Coordinate ToCoordinate()
    {
        return new Coordinate(X, Y);
    }
}

写完后,我继续使用 1024 个点的数据集进行测试,这些点的 XY 坐标随机分布在

[0.0, 10.0)
内,同时设置
k = 12
和查询点
(0.5, 0.6)

ID        X                 Y

1|1,629170883274251|5,007505367979177
2|0,7293811350732022|6,2009057338353735
3|4,087100198532967|8,718630060888188
4|6,081353833936785|0,508181154964576
... (up to point 1024)

一切似乎都运行正常,但是它不是返回 12 点,而是返回 13

控制台输出:

Enter filename with full path: M:\Visual Studio\DPTProj\rangedataset.txt
How many neighbors do you want to retrieve (k = ?)?
Type here: 12

Building R-tree...
R-tree built successfully.

Estimated range for k = 12: 0,8435869948686181

Points within radius 0,8435869948686181 of (0,5, 0,6):
1: (0,6843068314177482, 0,46912474579602703) | Distance to query point: 0,22604720805664638
2: (0,4947431154990304, 0,2580122092077565) | Distance to query point: 0,34202819165328435
3: (0,09012448605621443, 0,4170682841991392) | Distance to query point: 0,4488451287209534
4: (0,8195993866862726, 0,259394385041387) | Distance to query point: 0,46707167855863035
5: (0,28279345495756414, 0,1435370604244699) | Distance to query point: 0,5055067738568947
6: (0,10963706770429252, 0,2626442072273438) | Distance to query point: 0,5159381259683864
7: (0,12078106408975137, 0,17282920897604392) | Distance to query point: 0,5712108945537835
8: (1,0621745516835592, 0,7497310129691526) | Distance to query point: 0,5817728103008762
9: (0,6616472921621275, 1,2313257349800906) | Distance to query point: 0,6516916684379966
10: (1,1860976280579798, 0,6747156943542489) | Distance to query point: 0,6901538887883075
11: (0,06693545266377528, 0,04315583037359446) | Distance to query point: 0,705422094498358
12: (0,7070193349882119, 1,280761478133854) | Distance to query point: 0,7115428273617488
13: (0,19052830067953483, 1,2878140673450258) | Distance to query point: 0,754228694706058

Returned 13 points - range query executed with one or more errors

(只要返回的点数与

k
不同,就会显示上面的“有一个或多个错误”行。)

选择不同的

k
(例如
k = 128
)也会发生这种情况,它会返回 147 分

当我通过调试器运行它时,似乎附加点被添加到

foreach
循环内的
while
循环中,如果第一个范围查询检索到 小于
k
,则执行后者。因此,随着范围扩大,更多的点被添加到
resultsList
,这是不需要的:

while (count < k)
        {
            range = ExpandRange(k, queryPoint, window, count); //expand range
            window = new Envelope(queryPoint.X - range, queryPoint.X + range, queryPoint.Y - range, queryPoint.Y + range);
    
            //repeat above code block
            resultsList = _rTree.Query(window).OrderBy(x => x.DistanceTo(queryPoint)).ToList(); //perform window query
            count = 0;
            foreach (Point point in resultsList) //<-- resultsList count is larger than k, so this runs for more times than it should
            {
                dist = point.DistanceTo(queryPoint);
    
                if (dist <= range)
                {
                    if (!nnList.Contains((point, dist))) //check if point already exists in list
                    {
                        nnList.Add((point, dist));
                    }
    
                    count++;
                }
            }
        }

运行一些测试后,根据论文中的公式,

ExpandRange()
函数似乎确实正确扩展了范围,因此范围不是问题。

ExpandRange() returned value formula (image)

我认为问题在于,直到检查完

foreach
中的所有点后,
resultsList
循环才会停止,这会导致添加多个
k
点,因为直到之后才会检查
while
循环的条件它完成了。我想出的一个解决方案是,一旦将
k
点添加到
nnList
中,就中断循环,有效地“截断”结果,但我认为事情没那么简单。

有什么办法可以阻止它返回超过

k
点吗?

c# algorithm knn range-query
1个回答
0
投票

我想出的一个解决方案是,一旦将 k 个点添加到 nnList 中,就中断循环,有效地“截断”结果,但我认为事情没有那么简单。

是的,您不能这样做,因为您不知道额外的点是否会比列表中的现有点更接近。

限制计数的一个简单方法就是在超过 k 时在 foreach 循环之后对列表进行排序和截断:

        foreach (Point point in resultsList) //<-- resultsList count is larger than k, so this runs for more times than it should
        {
            dist = point.DistanceTo(queryPoint);

            if (dist <= range)
            {
                if (!nnList.Contains((point, dist))) //check if point already exists in list
                {
                    nnList.Add((point, dist));
                }

                count++;
            }
        }

        // Add this code to take the first k points ordered by distance:
        if(count > k)
           nnList = nnList.OrderBy(p => p.Item2).Take(k).ToList();
© www.soinside.com 2019 - 2024. All rights reserved.