我正在为一个大学项目编写一个使用 R 树进行空间数据处理的算法,其中我们的任务是编写一个 KNN 算法作为使用范围估计的范围查询。我们不局限于单一编程语言,所以我选择了 C#。
我已经建立了R树、Point类和KNN算法的基础,但我不知道如何实现有效的范围估计函数,所以我在网上搜索找到现有的实现。我发现了一篇论文(“使用范围估计对远程空间数据库进行有效的 k 最近邻查询”),该论文提出了一种基于密度的解决方案,其中范围是根据 MBB(最小边界框)的密度来估计的整个数据集。
Density = number of points in MBB / area of MBB
这是我的课程的一部分,我在其中实现了论文中编写的功能:
public class KNNRangeQueryRTreeDensity {
... (constructor and pivate fields omitted)
private double GetDensityOfDatasetByGeometry(Geometry geom)
{
int numOfObjs = geom.NumPoints;
Envelope env = geom.EnvelopeInternal;
double area = (env.MaxX - env.MinX) * (env.MaxY - env.MinY);
return numOfObjs / area;
}
public double EstimateRangeWithDensity(int k, Geometry databaseGeometry)
{
return Math.Sqrt(k / (Math.PI * GetDensityOfDatasetByGeometry(databaseGeometry)));
}
public double ExpandRange(int k, Point queryPoint, Envelope prevEnv, int numOfNNRetrieved)
{
double range, envDensity;
if (numOfNNRetrieved == 0)
{
range = 2 * ((prevEnv.MaxX - prevEnv.MinX) / 2);
}
else
{
envDensity = numOfNNRetrieved / ((prevEnv.MaxX - prevEnv.MinX) * (prevEnv.MaxY - prevEnv.MinY));
range = Math.Sqrt(k / (Math.PI * envDensity));
}
return range;
}
public (List<(Point, double)>, double) KNNQuery(int k, Point queryPoint)
{
double dist;
double range = EstimateRangeWithDensity(k, _pointsGeometry); //perform range estimation
Envelope window = new Envelope(queryPoint.X - range, queryPoint.X + range, queryPoint.Y - range, queryPoint.Y + range); //create a new window (envelope) using the query point and range
IList<Point> resultsList = _rTree.Query(window).OrderBy(x => x.DistanceTo(queryPoint)).ToList(); //perform window query
List<(Point, double)> nnList = new List<(Point, double)>(k); //create list with a capacity of k
int count = 0; //this keeps track of how many points are actually added to nnList
foreach (Point point in resultsList) //perform initial query based on the range estimation above
{
dist = point.DistanceTo(queryPoint);
if (dist <= range)
{
nnList.Add((point, dist)); //add a tuple consisting of the point and its distance to the query point to the list
count++;
}
}
/* the above range estimate is not loose and because of that it might fail (i.e. may return less than k)
* when that happens, we need to further refine the range estimate. */
while (count < k)
{
range = ExpandRange(k, queryPoint, window, count); //expand range
window = new Envelope(queryPoint.X - range, queryPoint.X + range, queryPoint.Y - range, queryPoint.Y + range);
//repeat above code block
resultsList = _rTree.Query(window).OrderBy(x => x.DistanceTo(queryPoint)).ToList(); //perform window query
count = 0;
foreach (Point point in resultsList) //perform initial query based on the range estimation above
{
dist = point.DistanceTo(queryPoint);
if (dist <= range)
{
if (!nnList.Contains((point, dist))) //check if point already exists in list
{
nnList.Add((point, dist));
}
count++;
}
}
}
return (nnList, range);
}
}
}
(注意:对于
R-tree
、Geometry
、Envelope
和 Coordinate
对象,我使用的是 .NET 拓扑套件。Point
类是我的。)
Point
班级:
public class Point
{
public double X { get; set; }
public double Y { get; set; }
public Point(double x, double y)
{
X = x;
Y = y;
}
public double DistanceTo(Point o)
{
return Math.Sqrt(Math.Pow(X - o.X, 2) + Math.Pow(Y - o.Y, 2));
}
public Coordinate ToCoordinate()
{
return new Coordinate(X, Y);
}
}
写完后,我继续使用 1024 个点的数据集进行测试,这些点的 XY 坐标随机分布在
[0.0, 10.0)
内,同时设置 k = 12
和查询点 (0.5, 0.6)
:
ID X Y
1|1,629170883274251|5,007505367979177
2|0,7293811350732022|6,2009057338353735
3|4,087100198532967|8,718630060888188
4|6,081353833936785|0,508181154964576
... (up to point 1024)
一切似乎都运行正常,但是它不是返回 12 点,而是返回 13。
控制台输出:
Enter filename with full path: M:\Visual Studio\DPTProj\rangedataset.txt
How many neighbors do you want to retrieve (k = ?)?
Type here: 12
Building R-tree...
R-tree built successfully.
Estimated range for k = 12: 0,8435869948686181
Points within radius 0,8435869948686181 of (0,5, 0,6):
1: (0,6843068314177482, 0,46912474579602703) | Distance to query point: 0,22604720805664638
2: (0,4947431154990304, 0,2580122092077565) | Distance to query point: 0,34202819165328435
3: (0,09012448605621443, 0,4170682841991392) | Distance to query point: 0,4488451287209534
4: (0,8195993866862726, 0,259394385041387) | Distance to query point: 0,46707167855863035
5: (0,28279345495756414, 0,1435370604244699) | Distance to query point: 0,5055067738568947
6: (0,10963706770429252, 0,2626442072273438) | Distance to query point: 0,5159381259683864
7: (0,12078106408975137, 0,17282920897604392) | Distance to query point: 0,5712108945537835
8: (1,0621745516835592, 0,7497310129691526) | Distance to query point: 0,5817728103008762
9: (0,6616472921621275, 1,2313257349800906) | Distance to query point: 0,6516916684379966
10: (1,1860976280579798, 0,6747156943542489) | Distance to query point: 0,6901538887883075
11: (0,06693545266377528, 0,04315583037359446) | Distance to query point: 0,705422094498358
12: (0,7070193349882119, 1,280761478133854) | Distance to query point: 0,7115428273617488
13: (0,19052830067953483, 1,2878140673450258) | Distance to query point: 0,754228694706058
Returned 13 points - range query executed with one or more errors
(只要返回的点数与
k
不同,就会显示上面的“有一个或多个错误”行。)
选择不同的
k
(例如 k = 128
)也会发生这种情况,它会返回 147 分。
当我通过调试器运行它时,似乎附加点被添加到
foreach
循环内的 while
循环中,如果第一个范围查询检索到 小于 k
点,则执行后者。因此,随着范围扩大,更多的点被添加到 resultsList
,这是不需要的:
while (count < k)
{
range = ExpandRange(k, queryPoint, window, count); //expand range
window = new Envelope(queryPoint.X - range, queryPoint.X + range, queryPoint.Y - range, queryPoint.Y + range);
//repeat above code block
resultsList = _rTree.Query(window).OrderBy(x => x.DistanceTo(queryPoint)).ToList(); //perform window query
count = 0;
foreach (Point point in resultsList) //<-- resultsList count is larger than k, so this runs for more times than it should
{
dist = point.DistanceTo(queryPoint);
if (dist <= range)
{
if (!nnList.Contains((point, dist))) //check if point already exists in list
{
nnList.Add((point, dist));
}
count++;
}
}
}
运行一些测试后,根据论文中的公式,
ExpandRange()
函数似乎确实正确扩展了范围,因此范围不是问题。
我认为问题在于,直到检查完
foreach
中的所有点后,resultsList
循环才会停止,这会导致添加多个k
点,因为直到之后才会检查while
循环的条件它完成了。我想出的一个解决方案是,一旦将 k
点添加到 nnList
中,就中断循环,有效地“截断”结果,但我认为事情没那么简单。
有什么办法可以阻止它返回超过
k
点吗?
我想出的一个解决方案是,一旦将 k 个点添加到 nnList 中,就中断循环,有效地“截断”结果,但我认为事情没有那么简单。
是的,您不能这样做,因为您不知道额外的点是否会比列表中的现有点更接近。
限制计数的一个简单方法就是在超过 k 时在 foreach 循环之后对列表进行排序和截断:
foreach (Point point in resultsList) //<-- resultsList count is larger than k, so this runs for more times than it should
{
dist = point.DistanceTo(queryPoint);
if (dist <= range)
{
if (!nnList.Contains((point, dist))) //check if point already exists in list
{
nnList.Add((point, dist));
}
count++;
}
}
// Add this code to take the first k points ordered by distance:
if(count > k)
nnList = nnList.OrderBy(p => p.Item2).Take(k).ToList();