为什么对于大数据集，“大小”会出现几个数字的错误？

Question

我正在编写一个线性探测程序并使用输入文件对其进行测试。我将我的输出与我提供的输出文件进行比较。对于小输入文件，我的输出是正确的。对于较大的输入文件（包含 1000 多个命令），我的大小方法在数组变得非常大后返回错误的数字。我的“尺寸”不正确最多 2 个（例如，如果正确尺寸是 248，我的程序将返回 250）。

这是我的代码（我是初学者程序员）：

import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

class LinearProbing{

    protected int size = 0;
    protected Scanner fileScanner = null;
    protected int[] array;
    protected boolean isRehashing = false;

    public LinearProbing(){
        array = new int[5];
        emptySpots();
    }

    public int hashValue(int x){
        return (x % array.length + array.length) % array.length;
    }

    public double loadFactor(){
        return (double) size / array.length;
    }

    public void emptySpots(){
        for (int i = 0; i < array.length; i++){
            array[i] = Integer.MIN_VALUE;
        }
    }

    public boolean contains(int x){
        int hashValue = hashValue(x);
        if (array[hashValue] == Integer.MIN_VALUE){ // Empty spot
            return false;
        }
        else{ // Spot is not empty -- we explore
            if (array[hashValue] == x){ // x is at the index it hashes to
                return true;
            }
            // x must be at at an index after hashValue, or it does not exist and we return false
            int index = hashValue;
            while (array[index] != Integer.MIN_VALUE){
                index = (index + 1) % array.length;
                if (array[index] == x){
                    return true;
                }
            }
        }
        return false;
    }

    public void rehash(){
        isRehashing = true;
        int[] oldArray = array; 
        array = new int[oldArray.length * 2]; 
        emptySpots();
        for (int i = 0; i < oldArray.length; i++){
            if (oldArray[i] != Integer.MIN_VALUE){
                insert(oldArray[i]);
            }
        }
        isRehashing = false;
    }

    public void insert (int x){
        if (contains(x)){ // We do not allow duplicates
            return;
        }
        int hashValue = hashValue(x);
        if (array[hashValue] == Integer.MIN_VALUE){ // The spot that x hashes to is empty
            array[hashValue] = x;
        }
        else{
            int index = hashValue;
            while (array[index] != Integer.MIN_VALUE){
                index = (index + 1) % array.length;
                if (array[index] == Integer.MIN_VALUE){
                    array[index] = x;
                    break;
                }
            }
        }
        if (!(isRehashing)){
            size++; // Since rehashing calls upon insert but doesn't affect the size, we do not increase size when rehashing
            if (loadFactor() > 0.7){ // If the array is more than 70% full, we rehash
                rehash();
            }
        }
    }

    public void remove(int x){
        int index = hashValue(x);
        if (array[index] == Integer.MIN_VALUE){ // Element could not have been inserted if array[hashValue] is empty, so we return
            return;
        }
        if (contains(x)){ // The set contains the element -- we continue
            if (array[index] == x){ // x is at the index it hashes to
                array[index] = Integer.MIN_VALUE;
            }
            else{ // x is at an index after hashValue
                while (array[index] != Integer.MIN_VALUE){
                    index = (index + 1) % array.length;
                    if (array[index] == x){
                        array[index] = Integer.MIN_VALUE;
                        break;
                    }
                }
            }
            size--;
            fillHole(index);
        }
    }

    public void fillHole(int index){ // Recursive
        int nextIndex = (index + 1) % array.length;
        if (array[nextIndex] == Integer.MIN_VALUE){ // The next spot is empty, so we return
            return;
        }
        if (hashValue(array[nextIndex]) == index){
            array[index] = array[nextIndex]; // We fill the hole with the element at nextIndex
            array[nextIndex] = Integer.MIN_VALUE;
            fillHole(nextIndex); // Now we have to fill the new hole we created
        }
    }

    public int size(){
        return size;
    }

    public void readFile(String fileName){
        try {
            fileScanner = new Scanner(new File(fileName));
        } 
        catch (FileNotFoundException e) {
            System.out.println("Could not read from file " + fileName);
            System.exit(-1);
        }
        while (fileScanner.hasNextLine()){
            String[] parts = fileScanner.nextLine().split(" ");
            if (parts[0].contains("insert")){
                insert(Integer.parseInt(parts[1]));
            }
            if (parts[0].contains("remove")){
                remove(Integer.parseInt(parts[1]));
            }
            if (parts[0].contains("contains")){
                System.out.println(contains(Integer.parseInt(parts[1])));
            }
            if (parts[0].contains("size")){
                System.out.println(size());
            }
        }
        fileScanner.close();
    }

    public static void main(String[] args){
        LinearProbing linearProbing = new LinearProbing();
        if (args.length > 0){
            linearProbing.readFile(args[0]);
        }
    }


}

如果您想亲自查看，这里是输入文件： https://github.uio.no/IN2010/effektive-mengder-ressursside/tree/main/inputs

这是输出文件： https://github.uio.no/IN2010/effektive-mengder-ressursside/tree/main/outputs

我尝试为重新哈希定制一个新的插入方法，以便重新哈希调用这个新的插入方法。这种新方法不会增加大小或检查重复项，因为它与重新哈希无关。它没有改变输出，所以我删除了它。

我已经纠正了循环中的错误，这确实产生了影响（我的程序过去常常返回与正确答案相差太远的大小）。由于我在程序上花了很多时间，所以我不记得我所做的所有更改。

Answer 1

我认为

remove()

的实现不正确：如果找到值，则清除该槽，但忽略后续槽可能包含散列到相同或前面槽的值这一事实。这使得这些值在后续调用中无法访问

这里有一个例子可以帮助您理解这个问题。想象一下，您有一个包含 3 个槽的哈希表，最初全部为空 (

[-1, -1, -1]

)，然后您插入两个值，a 和 b，它们都恰好哈希到槽 0。然后插入后，您的哈希表看起来像这个：

[a, b, -1]

。现在如果你删除例如元素 a 位于槽 0 中，哈希表变为

[-1, b, -1]

。由于槽 0 为空，因此无法再找到值 b。

contains(b)

将返回 false（错误），并且

insert(b)

会将其重新插入到槽 0（错误），创建一个包含两次相同值的哈希表

[b, b, -1]

，这解释了为什么大小是错误的。

此问题的标准解决方案是用墓碑值（例如-2）重写空槽，这不会终止

contains()

或

insert()

中的搜索，或者通过以下方式消除线性探测序列中的间隙：向后移动后续元素（例如，删除 a 时，

[a, b, -1]

变为

[b, -1, -1]

）。

对于第一个解决方案，您必须确保墓碑仍计入负载因子。第二种解决方案是可能的，但很难正确解决，因为线性探测序列可能包含散列到不同槽的元素，并且它们甚至不需要按顺序出现！因此，如果这是一个玩具问题，您可能想坚持使用墓碑，因为它们更容易正确实现。

为什么对于大数据集，“大小”会出现几个数字的错误？

问题描述投票：0回答：1

1个回答

最新问题

为什么对于大数据集，“大小”会出现几个数字的错误？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1