自定义inputformat用于读取hadoop中的json

问题描述 投票:2回答:1

我是hadoop的初学者,我被告知创建一个自定义inputformat类来读取json数据,我已经google了,并学会了如何创建一个自定义inputformat类来从文件中读取数据。但我坚持解析json数据。我的json数据看起来像这样

[
    {
        "_count": 30,
        "_start": 0,
        "_total": 180,
        "values": [
            {
                "attachment": {
                    "contentDomain": "techcarnival2013.eventbrite.com",
                    "contentUrl": "http://techcarnival2013.eventbrite.com/",
                    "imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png",
                    "summary": "Get to know a few thousand of Silicon Valley's best and brightest while enjoying unparalleled access to Candlestick Park,\u00a0games, food, music and more. We'll have carnival games you haven't played since you were ten, giant inflatable obstacle...",
                    "title": "Tech Carnival @ Candlestick Park"
                },
                "comments": {
                    "_total": 0
                },
                "creationTimestamp": 1373908436000,
                "creator": {
                    "firstName": "Clayton",
                    "headline": "Director of Operations",
             "secondname":{
                "name":"myname"
                },
                    "lastName": "K.",
                    "pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj"
                },
                "likes": {
                    "_total": 0
                },
                "relationToViewer": {
                    "availableActions": {
                        "_total": 7,
                        "values": [
                            {
                                "code": "add-comment"
                            },
                            {
                                "code": "categorize-as-job"
                            },
                            {
                                "code": "categorize-as-promotion"
                            },
                            {
                                "code": "flag-as-inappropriate"
                            },
                            {
                                "code": "follow"
                            },
                            {
                                "code": "like"
                            },
                            {
                                "code": "reply-privately"
                            }
                        ]
                    },
                    "isFollowing": false,
                    "isLiked": false
                },
                "summary": "Network with 4,000+ from the tech community, including folks from DFJ, Google, LinkedIn, Square, Uber, Y Combinator, 500 Startups, etc. $10 ticket gets you all-you-can-ride access to the pop-up Tech Carnival, will be the biggest Wednesday night of the tech summer.",
                "title": "Tech Event @ Candlestick Park on Wednesday, July 17th! Come play carnival games with ~4,000 of the Bay area's best and brightest!"
            },
            {
                "attachment": {
                    "contentDomain": "lifebeyondnumbers.com",
                    "contentUrl": "http://bit.ly/10VTqMu",
                    "imageUrl": "http://lifebeyondnumbers.com/wp-content/uploads/2013/07/lurnq_Online_Courses.jpg",
                    "summary": "LurnQ offers a platform for learning and teaching that is free for everyone. It caters to a diverse online audience and is relevant to everyone in general. The key segment that we address now is of life long learners.",
                    "title": "LurnQ - making lifelong learning clutter free, fun and a social..."
                },
                "comments": {
                    "_total": 0
                },
                "creationTimestamp": 1373883177000,
                "creator": {
                    "firstName": "Syed",
                    "headline": "Founder and CEO at QubiqSquare",
                    "lastName": "Muksit",
                    "pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_Y5gdzlRCbQBTqIa-pXYnz-01b6KinDO-pFWnz-ZCZLk1WWdt-_SLUt2uWmrpzo0OxQxcVv6pRjbE"
                },
                "likes": {
                    "_total": 0
                },
                "relationToViewer": {
                    "availableActions": {
                        "_total": 7,
                        "values": [
                            {
                                "code": "add-comment"
                            },
                            {
                                "code": "categorize-as-job"
                            },
                            {
                                "code": "categorize-as-promotion"
                            },
                            {
                                "code": "flag-as-inappropriate"
                            },
                            {
                                "code": "follow"
                            },
                            {
                                "code": "like"
                            },
                            {
                                "code": "reply-privately"
                            }
                        ]
                    },
                    "isFollowing": false,
                    "isLiked": false
                },
                "summary": "LurnQ offers a platform for learning and teaching that is free for everyone. It caters to a diverse online audience and is relevant to everyone in general. The key segment that we address now is of life long learners.",
                "title": "There is so much to learn and most of the times, we don\u2019t even know that this-and-that good stuff exists.  http://bit.ly/10VTqMu"
            },
            {
                "attachment": {
                    "contentDomain": "techcarnival2013.eventbrite.com",
                    "contentUrl": "http://techcarnival2013.eventbrite.com/",
                    "imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png",
                    "summary": "Get to know a few thousand of Silicon Valley's best and brightest while enjoying unparalleled access to Candlestick Park,\u00a0games, food, music and more. We'll have carnival games you haven't played since you were ten, giant inflatable obstacle...",
                    "title": "Tech Carnival @ Candlestick Park"
                },
                "comments": {
                    "_total": 0
                },
                "creationTimestamp": 1373654758000,
                "creator": {
                    "firstName": "Clayton",
                    "headline": "Director of Operations",
                    "lastName": "K.",
                    "pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj"
                },
                "likes": {
                    "_total": 0
                },
                "relationToViewer": {
                    "availableActions": {
                        "_total": 7,
                        "values": [
                            {
                                "code": "add-comment"
                            },
                            {
                                "code": "categorize-as-job"
                            },
                            {
                                "code": "categorize-as-promotion"
                            },
                            {
                                "code": "flag-as-inappropriate"
                            },
                            {
                                "code": "follow"
                            },
                            {
                                "code": "like"
                            },
                            {
                                "code": "reply-privately"
                            }
                        ]
                    },
                    "isFollowing": false,
                    "isLiked": false
                },
                "summary": "Network with 4,000+ from the tech community, including folks from DFJ, Google, LinkedIn, Square, Uber, Y Combinator, 500 Startups, etc. $10 ticket gets you all-you-can-ride access to the pop-up Tech Carnival, will be the biggest Wednesday night of the tech summer.",
                "title": "Tech Event @ Candlestick Park on Wednesday, July 17th! Come play carnival games with ~4,000 of the Bay area's best and brightest!"
            }
..........
........ so on

]

所以我很困惑如何在我的自定义inputformat类中读取json对象。关于如何解析这个的想法?我想读取json数组中的单个json对象,我的意思是读取正确的json字符串然后将字符串赋予map我将在地图中使用json解析器来构建我自己的键值对。有关这方面的帮助吗?提前感谢

json hadoop mapreduce bigdata
1个回答
1
投票

如果您的问题符合Magham Ravi的评论,那么答案就是好的。

但是,如果您有一个包含上述所有JSON数据的文件,您可能需要读取整个文件并从map函数中的值部分(BytesWritable值)将其作为String检索并将其提供给JSON解析器可在同一map()函数中使用。

请看看WholeFileInputFormat

此外,如果您在单个文件中说多个JSON对象数据以及将每个JSON对象数据作为映射器中的值的内容,您可以使用类似于定义了开始和结束标记的XMLInputFormat。对于JSON,您必须具有唯一的开始和结束标记,这些标记准确标记所需的单个JSON数据对象的开头和结尾。仅仅使用start-tag =“[{”和end-tag =“}]”可能没有用,如果你想将上面的整个JSON对象作为一个值返回,因为你已经有许多嵌套会混淆InputFormat。

如果在任何情况下都无法实现上述功能,请尝试构建customTextInputFormat,覆盖TextInputFormat中定义的LineReader

在LineReader类中,你会很好这两个集合(我可能有点过时,请检查现在是否可配置使用配置属性,我知道CDH已经使它可配置,如果不是你需要覆盖)

private static final byte CR = '\r';
private static final byte LF = '\n';

你可以放开CR并改变LF指向“] \ n [”,因为你的每个独立的JSON数据都是如图所示的形式,或者你会更好地了解它们如何?

[

...... JSON 1

]

[

...... JSON 2

]

[

...... JSON N.

]

(注意:之间有一个\ n]和[标记为不同JSON对象数据之间的边界。

希望这是有道理的。

© www.soinside.com 2019 - 2024. All rights reserved.