我已经为我的模型使用了 XGBoost。我注意到 h2o 集群在此模型过程中不共享内存。 master A 服务器 RAM 利用率非常高,而 master B RAM 利用率非常低。我检查了两台服务器上的 h2o 日志,发现主 A 日志文件在模型处理时不断更新,但主 B 日志文件没有更新。它仅显示集群创建的日志
有时在模型上处理 master A h2o jar 由于内存使用率高而宕机。
我正在使用 h2o-3.36.1.1 版本并创建了两个节点集群。集群已成功创建并在日志文件中记录集群详细信息。
我检查了主控 A 和 B 的连通性,并在两侧进行了卷曲。一切正常,集群运行良好。
谁能帮我解决这些问题。
为什么两个服务器在模型处理时不共享服务器资源?
为什么大师 B h2o 日志不更新?
为什么 master A h2o jar down 内存占用高?
大师A日志
main INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
main INFO water.default:
FJ-126-15 INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
058452-166 INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
058452-166 INFO water.default: Locking cloud to new members, because water.api.schemas3.MetadataV3
4058452-14 INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
4058452-15 INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
4058452-18 INFO water.default: POST /4/sessions, parms: {}
4058452-16 INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_a391}
4058452-13 INFO water.default: DELETE /3/DKV, parms: {}
4058452-13 INFO water.default: Removing all objects
4058452-13 INFO water.default: Finished removing objects
4058452-12 INFO water.default: DELETE /3/DKV, parms: {}
4058452-12 INFO water.default: Removing all objects
4058452-12 INFO water.default: Finished removing objects
058452-170 INFO water.default: DELETE /3/DKV, parms: {}
058452-170 INFO water.default: Removing all objects
058452-170 INFO water.default: Finished removing objects
4058452-14 INFO water.default: GET /3/Metadata/schemas/CloudV3, parms: {}
058452-169 INFO water.default: GET /3/Metadata/schemas/H2OErrorV3, parms: {}
058452-166 INFO water.default: GET /3/Metadata/schemas/H2OModelBuilderErrorV3, parms: {}
4058452-19 INFO water.default: POST /4/sessions, parms: {}
4058452-18 INFO water.default: POST /99/Rapids, parms: {ast=(setTimeZone "UTC"), session_id=_sid_bfac}
058452-170 INFO water.default: Reading byte InputStream into Frame:
058452-170 INFO water.default: frameKey: upload_bbcd4f6aeb3c1095e63f66a89cdd4756
058452-170 INFO water.default: totalChunks: 2
058452-170 INFO water.default: totalBytes: 4404663
058452-170 INFO water.default: Success.
058452-167 INFO water.default: POST /3/ParseSetup, parms: {single_quotes=False, source_frames=["upload_bbcd4f6aeb3c1095e63f66a89cdd4756"], check_header=0}
058452-169 INFO water.default: Total file size: 4.2 MB
058452-169 INFO water.default: Parse chunk size 4194304
FJ-1-15 INFO water.default: Parse result for Key_Frame__upload_bbcd4f6aeb3c1095e63f66a89cdd4756.hex (2023 rows, 436 columns):
FJ-1-15 INFO water.default: ColV2 type min max mean sigma NAs constant cardinality
FJ-1-15 INFO water.default: COL1: factor 011022232 YA9854024 1334
FJ-1-15 INFO water.default: COL2: numeric 2019.00 2020.00 2019.70 0.457960
FJ-1-15 INFO water.default: COL3: numeric 1.00000 12.0000 6.07860 2.82287
FJ-1-15 INFO water.default: COL4: factor |00011000813 |09988000074 1334
FJ-1-15 INFO water.default: COL5: factor CUST NAME CUSTOMER 2
FJ-1-15 INFO water.default: COL6: numeric 1.14005e+08 4.10024e+08 2.96146e+08 4.57328e+07
FJ-1-15 INFO water.default: COL7: numeric 10000.0 30000.0 28294.6 5573.93 3
FJ-1-15 INFO water.default: COL8: factor USD 4
FJ-1-15 INFO water.default: COL9: factor 927 RM17 20
FJ-1-15 INFO water.default: COL10: factor NO YES 2
FJ-1-15 INFO water.default: Additional column information only sent to log file...
FJ-1-15 INFO water.default: COL11: numeric -1.00000 175.250 1.07602 5.07740
FJ-1-15 INFO water.default: COL12: numeric -1.00000 97.2262 0.447662 3.19167
FJ-1-15 INFO water.default: COL13: numeric -1.00000 124.206 1.03933 3.94221
FJ-1-15 INFO water.default: response_class: factor 1A to_be_filled 5
FJ-1-15 INFO water.default: response_class_5: factor 1B 1B1 2
FJ-1-15 INFO water.default: response_class_4: factor 1A NON_PERFORME 4
FJ-1-15 INFO water.default: response_class_3: factor 1A NON_PERFORME 4
FJ-1-15 INFO water.default: response_class_2: factor 1A NON_PERFORME 4
FJ-1-15 INFO water.default: response_class_1: factor 1A NON_PERFORME 4
FJ-1-15 INFO water.default: subset: factor test train 2
FJ-1-15 INFO water.default: Chunk compression summary:
FJ-1-15 INFO water.default: Chunk Type Chunk Name Count Count Percentage Size Size Percentage
FJ-1-15 INFO water.default: C0L Constant long 74 8.486 % 5.8 KB 0.207 %
FJ-1-15 INFO water.default: CBS Binary 19 2.179 % 4.4 KB 0.159 %
FJ-1-15 INFO water.default: CXI Sparse Integers 80 9.174 % 25.0 KB 0.897 %
FJ-1-15 INFO water.default: CXF Sparse Reals 50 5.734 % 48.9 KB 1.753 %
FJ-1-15 INFO water.default: C1 1-Byte Integers 7 0.803 % 11.8 KB 0.423 %
FJ-1-15 INFO water.default: C1N 1-Byte Integers (w/o NAs) 92 10.550 % 104.0 KB 3.731 %
FJ-1-15 INFO water.default: C1S 1-Byte Fractions 142 16.284 % 118.4 KB 4.245 %
FJ-1-15 INFO water.default: C2 2-Byte Integers 72 8.257 % 231.7 KB 8.309 %
FJ-1-15 INFO water.default: C2S 2-Byte Fractions 18 2.064 % 22.9 KB 0.822 %
FJ-1-15 INFO water.default: C4 4-Byte Integers 50 5.734 % 109.1 KB 3.913 %
FJ-1-15 INFO water.default: C4S 4-Byte Fractions 127 14.564 % 360.5 KB 12.925 %
FJ-1-15 INFO water.default: C8 8-byte Integers 1 0.115 % 15.0 KB 0.539 %
FJ-1-15 INFO water.default: CUD Unique Reals 5 0.573 % 13.2 KB 0.472 %
FJ-1-15 INFO water.default: C8D 64-bit Reals 135 15.482 % 1.7 MB 61.606 %
FJ-1-15 INFO water.default: Frame distribution summary:
FJ-1-15 INFO water.default: Size Number of Rows Number of Chunks per Column Number of Chunks
B大师
main INFO water.default: H2O started in 4906ms
main INFO water.default:
main INFO water.default: Open H2O Flow in your web browser: http://xxx.xxx.xxx.xx:54321
main INFO water.default:
FJ-126-15 INFO water.default: Cloud of size 2 formed [master01.user.com/xxx.xxx.xxx.xx:54321, master02.user.com/xxx.xxx.xxx.xx:54321]
FJ-123-15 INFO water.default: Locking cloud to new members, because Class Id=56
FJ-2-15 INFO water.default: Key upload_bbcd4f6aeb3c1095e63f66a89cdd4756 will be parsed using method DistributedParse.
FJ-2-21 INFO water.default: Key upload_902bcdd31a4aea9f65690f1bc6074886 will be parsed using method DistributedParse.