返回 MongoDB 中没有关联文档的链接

Question

我在 MongoDB 中存储了一组网页。每个文档至少包含一个主机名和路径名值以及在字段“a”中找到的所有链接的数组。该文档仅在尝试访问后才存在，如下所示：

{
    _id:"deadbeefdeadbeef01234567",
    hostname:"www.example.com",
    pathname:"/",
    a:[{
        href:"/x.css",
        otherattribute:true
    },{
        href:"/",
        otherattribute:false
    }]
},{
    _id:"deadbeefdeadbeef01234568",
    hostname:"www.example.com",
    pathname:"/other",
    a:[{
        href:"/resource",
        otherattribute:true
    },{
        href:"/"
    }]
}

我有一个聚合，通过 MongoDB 的奇迹，它返回一个独特链接的列表，如下所示：

{
    _id:"www.example.com",
    items:[
        "/",
        "/x.css",
        "/resource"
    ]
}

然后我在 Node.JS 中返回它并这样使用它：

aeach(items,async function(i,iv){
    //! Where i == index and iv == link string

    let result = await db("dbname")
        .collection("webpages")
        .findOne({
            pathname:iv
        })

    if(!result){
        candidate = iv;
        return true;
    }
})

我必须返回唯一链接的聚合很棒并且性能非常好，但是当我正在爬行一个大型网络结构时，我需要进行潜在的数千次查找才能找到我尚未尝试访问的网页。

aeach

是一个类似于

forEach

的运算符，它使用

async function

回调。在该函数内返回

true

将

break

循环和

false

与

continue

循环。

所以问题是：

如何仅返回不对应于任何文档

pathname

字段的字符串列表（尚未尝试访问）

编辑：

我目前的汇总如下：

[{
    $match: {
        hostname: scope.connection.hostname
    }
},{
    $project: {
        a: "$a"
    }
},{
    $unwind: {
        path: "$a"
    }
},{                             // Remove some junk elements
    $match:{
        "a.href": {
            $not: /(#|link)/gi
        }
    }
},{
    $group: {
        _id: scope.connection.hostname,
        items: {
            $addToSet: "$a"
        }
    }
}]

Answer 1

Itamar 做出了出色的观察。我可以通过

pathname

携带

aggregation

并稍后对其进行操作，因此我为将来可能需要它的任何人构建了以下

aggregate

。

[{
    $match:{
        hostname:scope.connection.hostname
    }
},{
    $project:{
        a: "$a",
        pathname: "$pathname"
    }
},{
    $unwind:{
        path: "$a"
    }
},{
    $match:{
        "a.href": {
            $not: /(#|link|search|http)/gi
        }
    }
},{
    $group:{
        _id: "hostname",
        items: {
            $addToSet: "$a.href"
        },
        paths: {
            $addToSet: "$pathname"
        }
    }
},{
    $project:{
        todo: {
            $setDifference: ["$items", "$paths"]
        }
    }
},{
    $unwind:{
        path: "$todo"
    }
},{
    $project:{
        _id: "$todo"
    }
},{
    $limit:200
}]

这将返回

中不存在于

pathname

字段中任何值中的元素。

返回 MongoDB 中没有关联文档的链接

问题描述投票：0回答：1

1个回答

最新问题

返回 MongoDB 中没有关联文档的链接

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1