在Haskell中解析大型Json数组并严格转换其元素

Question

我有一个 json 文件，其中包含（除其他外）一个大型多精度浮点数嵌套数组。每个浮点数都是带引号的数字字符串，例如“3.14159265358979323846264338”。我想解析该文件，并在读取每个元素时立即将其转换为数字数据类型（特别是来自此 mpfr 库的多精度浮点数）。原因是多精度浮点数占用的内存比相应的数字字符串少得多。我愿意在内存中存储大量浮点数，但我不想将文件中的所有文本存储在内存中。

我知道 json 流媒体库，例如 json-stream。但是，我不知道如何在解析过程中强制将底层 ByteString 转换为浮点数。天真地，我似乎只会构建一个包含 ByteString 的 thunk 数组，直到文件被完全解析，然后当强制使用最终值时，这些 thunk 将被转换为浮点数，并且 ByteString 将被 gc'd。我怎样才能避免这种情况？

Answer 1

假设您愿意将这些存储为

Double

s。

可以使用

json-stream

直接解析为

Vector

。如果您的 JSON 文件只有纯数字而不是带引号的字符串，那么它会很简单：

main1 :: IO ()
main1 = do
    f <- BL.readFile "/tmp/array_noquotes.json"
    let nums = parseLazyByteString (arrayOf real) f :: [Double]
        !dat = V.fromList nums
    pure ()

由于您有引号，因此您需要一个稍微复杂的解析器。我不确定这是否是最好的方法，但以下方法有效：

main2 :: IO ()
main2 = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ read (T.unpack str) 
               | str <- parseLazyByteString (arrayOf string) f]
               :: [Double]
    let !dat = V.fromList nums
    pure ()

无论如何，在 pi 副本的 1000 万个元素数组上运行（“3.14159265358979323846264338”的 1000 万个副本，总文件大小为 300M）此版本运行相对较快，并达到 1G 的最大驻留量，但那是 - - 当然 - 因为它正在生成一个用于解析的 thunk 向量。存储 thunk 本身加上 300M 文件中的所有字节字符串及其开销很容易使我们达到 1G。

但是，强制这些解析重击很容易。只要写：

main3 :: IO ()
main3 = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ num
               | str <- parseLazyByteString (arrayOf string) f
               , let !num = read (T.unpack str) ] -- force the thunks
               :: [Double]
    let !dat = V.fromList nums
    pure ()

这需要明显更长的运行时间，但最大驻留时间减少到 200M。这可能看起来仍然很大，但是 1000 万个

Double

的盒装向量每个

Double

需要 2 个单词，用于

D#

构造函数和底层

Double#

值本身。每个元素 16 个字节，因此我们的示例为 160M。

您可以通过使用未装箱的向量做得更好，它会删除所有不必要的

D#

构造函数：

let !dat = VU.fromList nums  -- VU = Data.Vector.Unboxed

在这种情况下，您不必强制 thunk，因为它们将作为创建未装箱向量的一部分而被强制。

最大驻留时间为 130M，其中约 80M 用于存储向量。

请注意，使用

Storable

向量（即 Data.Vector.Storable` 中的

VS.fromList

）速度较慢，但具有相同的空间使用量。而且，同样不需要强制执行 thunk，因为它会在创建可存储向量的过程中自动发生。

那么，您提到的包装中的

Rounded

怎么样？

好吧，如果考虑有效的空间利用，这可能不是一个很好的选择。在内部，它是一个具有四个字段（一个精度、一个符号、一个指数和一个字节数组）的数据类型，因此拆箱后它是五个字（构造函数加四个字段）加上一个字节数组（标头、一个表示大小的字和一个字节数组）。有效负载）。因此，对于

Rounded TowardZero 128

，我们可能正在谈论 56 字节的开销加上 16 字节的有效负载，所以 72 字节？果然，如果我们在不强制的情况下构建一个

Rounded

向量：

main5 :: IO ()
main5 = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ read (T.unpack str)
               | str <- parseLazyByteString (arrayOf string) f]
               :: [Rounded TowardZero 128]
    let !dat = V.fromList nums
    pure ()

它完成得很快，需要大约 GB 来存储解析 thunk，但如果我们强制解析 thunk：

main6 :: IO ()
main6 = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ num
               | str <- parseLazyByteString (arrayOf string) f
               , let !num = read (T.unpack str) ] -- force the thunks
               :: [Rounded TowardZero 128]
    let !dat = V.fromList nums
    pure ()

运行时间更长，达到最大驻留时间740M。这是对 thunk 的“节省”，但这并不是一个很大的区别（事实上，对于像

"42.1"

（四个字节，字符串形式）这样的数字来说，它的性能会更差。

此外，由于

Unboxed

没有

Storable

或

Rounded

实例，因此我们没有其他选择来提高这种表示的效率。

这是我的各种测试用例：

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE BangPatterns #-}

module Main (main) where

import Data.List
import Data.Scientific
import Numeric.Rounded
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text as T
import Data.JsonStream.Parser
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed as VU
import qualified Data.Vector.Storable as VS

writeExamples :: IO ()
writeExamples = do
  let count = 1000000  -- only a million, for quick testing
  writeFile "/tmp/array_noquotes.json" $ "[" <> intercalate "," (replicate count "3.14159265358979323846264338") <> "]"
  writeFile "/tmp/array_quotes.json" $ "[" <> intercalate "," (replicate count "\"3.14159265358979323846264338\"") <> "]"

-- parse the no quote version, leaving thunks in place
main_noquotes :: IO ()
main_noquotes = do
    f <- BL.readFile "/tmp/array_noquotes.json"
    let nums = parseLazyByteString (arrayOf real) f :: [Double]
        !dat = V.fromList nums
    pure ()

-- parse the quote version, leaving thunks in place
main_quotes :: IO ()
main_quotes = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ read (T.unpack str)
               | str <- parseLazyByteString (arrayOf string) f]
               :: [Double]
    let !dat = V.fromList nums
    pure ()

-- parse the quote version, forcing the vector element thunks
mainV_forced :: IO ()
mainV_forced = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ num
               | str <- parseLazyByteString (arrayOf string) f
               , let !num = read (T.unpack str) ] -- force the thunks
               :: [Double]
    let !dat = V.fromList nums
    pure ()

-- parse the quote version into an unboxed vector
-- (elements are implicitly forced)
mainVU :: IO ()
mainVU = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ read (T.unpack str)
               | str <- parseLazyByteString (arrayOf string) f]
               :: [Double]
    let !dat = VU.fromList nums
    pure ()

-- parse the quote version into a storable vector
-- (elements are implicitly forced)
mainVS :: IO ()
mainVS = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ read (T.unpack str)
               | str <- parseLazyByteString (arrayOf string) f ]
               :: [Double]
    let !dat = VS.fromList nums
    pure ()

-- parse to Rounded, leave as thunks
mainRounded_thunks :: IO ()
mainRounded_thunks = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ read (T.unpack str)
               | str <- parseLazyByteString (arrayOf string) f]
               :: [Rounded TowardZero 128]
    let !dat = V.fromList nums
    pure ()

-- parse to Rounded, forcing the element thunks
mainRounded_forced :: IO ()
mainRounded_forced = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ num
               | str <- parseLazyByteString (arrayOf string) f
               , let !num = read (T.unpack str) ] -- force the thunks
               :: [Rounded TowardZero 128]
    let !dat = V.fromList nums
    pure ()

-- parse to Scientific, forcing the element thunks
mainScientific :: IO ()
mainScientific = do
    f <- BL.readFile "/tmp/array_quotes.json"
    let nums = [ num
               | str <- parseLazyByteString (arrayOf string) f
               , let !num = read (T.unpack str) ] -- force the thunks
               :: [Scientific]
    let !dat = V.fromList nums
    pure ()

main :: IO ()
main = mainRounded_forced

在Haskell中解析大型Json数组并严格转换其元素

问题描述投票：0回答：1

1个回答

最新问题

在Haskell中解析大型Json数组并严格转换其元素

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1