您的位置:首页 > 数据库

Db2性能问题:临时表空间太大,导致连不上数据库

2017-09-30 12:43 441 查看

问题现象:

某Db2系统经常出现连不上数据库的情况(连库一直HANG住),过一段时间自己就好了。期间观察到有latch现象。于是收集了db2pd -latch和db2pd -stack all的数据。

数据分析:

1. 首先查看latch wait现象,有109个应用在等SQLO_LT_sqeLocalDatabase__dblatch,这个latch是连库所必须的, 就是因为等这个latch,所以连不上库:

qingsong@db2a:~$ grep -i SQLO_LT_sqeLocalDatabase__dblatch latch.out | wc -l

109

等这个latch的应用的stack显示在尝试连库:

<StackTrace>

-------Frame------ ------Function + Offset------

0x09000000078484FC sqloXlatchConflict + 0x70

0x09000000078486C0 sqloXlatchConflict@glue1AA + 0x78

0x0900000007A56F48 StartUsingLocalDatabase__8sqeDBMgrFP8SQLE_BWAP8sqeAgentRccP8sqlo_gmt + 0xC6C

0x0900000007A53510 AppStartUsing__14sqeApplicationFP8SQLE_BWAP8sqeAgentcT3P5sqlcaPc + 0x17C

0x0900000007A8B460 AppLocalStart__14sqeApplicationFP14db2UCinterface + 0x49C

0x0900000007994828 sqlelostWrp__FP14db2UCinterface + 0x3C

0x09000000079948D8 sqleUCengnInit__FP14db2UCinterfaceUs + 0x60

0x0900000007A94D80 sqleUCagentConnect + 0x2A8

0x0900000007A96294 sqljsConnectAttach__FP13sqljsDrdaAsCbP14db2UCinterface + 0x104

0x0900000007A958E4 sqljs_ddm_accsec__FP14db2UCinterfaceP13sqljDDMObject + 0x340

0x0900000007A9550C sqljsParseConnect__FP13sqljsDrdaAsCbP13sqljDDMObjectP14db2UCinterface + 0x50

0x0900000007A95450 .sqljsParse.fdpr.clone.0__FP13sqljsDrdaAsCbP14db2UCinterfaceP8sqeAgentb + 0x1C

0x0900000007940BA4 @63@sqljsSqlam__FP14db2UCinterfaceP8sqeAgentb + 0x2D4

0x0900000007ACF5DC @63@sqljsDriveRequests__FP8sqeAgentP14db2UCconHandle + 0xB4

0x0900000007ACF290 @63@sqljsDrdaAsInnerDriver__FP18SQLCC_INITSTRUCT_Tb + 0x2D0

0x0900000007ACED1C sqljsDrdaAsDriver__FP18SQLCC_INITSTRUCT_T + 0x100

0x0900000007796D1C RunEDU__8sqeAgentFv + 0x2F4

0x0900000007793418 EDUDriver__9sqzEDUObjFv + 0xF4

0x090000000778A1CC sqloEDUEntry + 0x264

</StackTrace>

<LatchInformation>

Waiting on latch type: (SQLO_LT_sqeLocalDatabase__dblatch) - Address: (7800000002eb208), Line: 1013, File: sqle_database_services.C

</LatchInformation>

2. 再看hold这个latch的,都是同一个应用,200876,看看这个应用的stack:

xxxxxxxx.200876.000.stack.txt

<StackTrace>

-------Frame------ ------Function + Offset------

0x0900000000007758 stat64 + 0x18

0x0900000007983800 sqlofindn2 + 0x84

0x09000000082745AC sqlofindn2@glue1F2 + 0x7C

0x0900000007B8C20C next__13SQLB_FIND_BUFFPP16sqlo_filefindbuf + 0x24

0x0900000009283660 sqlbMonSMSGetContainerPages__FP12SQLB_POOL_CBiPUi + 0x214

0x0900000009564370 sqlbSnapshotTablespace__FRC16sqlm_bp_int_infoP16sqeLocalDatabaseCUiT3P15sqlm_tablespaceP19sqm_snapshot_bufferP16sqlm_header_infob + 0xB0

0x09000000092ABE98 @92@write_sqlm_ts__FPC16sqm_agent_entityP16sqeLocalDatabaseCUiT3P19sqm_snapshot_bufferP16sqlm_header_info + 0x200

0x09000000092AB838 @92@snap_tspaces_for_db__FP16sqm_agent_entityP19sqm_snapshot_bufferP16sqeLocalDatabaseP14sqlm_collectedPCcCUiPP22sqlm_tablespace_headerP16sqlm_header_info + 0x420

0x0900000009512DB0 sqlmonssagnt__FUi13sqm_entity_idP6sqlmaiT1PvP14sqlm_collectedUsT7P5sqlca + 0xA8

0x090000000841D444 sqlmonssbackend__FP12SQLE_DB2RA_T + 0x22C

0x0900000008D9F2BC sqlesrvr__FP14db2UCinterface + 0x678

0x0900000007A4C6CC sqleMappingFnServer__FP5sqldaP5sqlca + 0x1D0

0x0900000007A4BFD4 sqlerKnownProcedure__FiPcPiP5sqldaT4P13sqlerFmpTableP8sqeAgentP5sqlca + 0x344

0x0900000007A25694 sqlerCallDL__FP14db2UCinterfaceP9UCstpInfo + 0x580

0x0900000007A25C64 .sqljs_ddm_excsqlstt.fdpr.clone.104__FP14db2UCinterfaceP13sqljDDMObject + 0xC0

0x090000000794213C sqljsParseRdbAccessed__FP13sqljsDrdaAsCbP13sqljDDMObjectP14db2UCinterface + 0x114

0x0900000007941EE4 .sqljsParse.fdpr.clone.0__FP13sqljsDrdaAsCbP14db2UCinterfaceP8sqeAgentb + 0x6DC

0x0900000007940BA4 @63@sqljsSqlam__FP14db2UCinterfaceP8sqeAgentb + 0x2D4

0x0900000007ACF5DC @63@sqljsDriveRequests__FP8sqeAgentP14db2UCconHandle + 0xB4

0x0900000007ACF290 @63@sqljsDrdaAsInnerDriver__FP18SQLCC_INITSTRUCT_Tb + 0x2D0

0x0900000007ACED1C sqljsDrdaAsDriver__FP18SQLCC_INITSTRUCT_T + 0x100

0x0900000007796D1C RunEDU__8sqeAgentFv + 0x2F4

0x0900000007793418 EDUDriver__9sqzEDUObjFv + 0xF4

0x090000000778A1CC sqloEDUEntry + 0x264

</StackTrace>

可以通过stack看到,它是在做表空间的snapshot,进一步来讲,MonSMSGetContainerPages是在 MONITOR SMS 表空间时获取容器(CONTAINER)的数据页(PAGE)。在Db2的设计中,snapshot是要获取dblatch的。所以,根据以上信息可以推断出连不上库的原因是临时表空间页数过多,导致snapshot获取它的容器信息时间过长,没有释放连库所需要的latch。

3. 再看一下其他应用的stack信息,搜索Temp关键字,发现有75个应用都在写临时表空间。可以由stack看到,是卡在了pwrite64,也就是写临时数据。

qingsong@db2a:~$ grep -i "sqldInsertTemp" -l *.stack.txt | wc -l

75

以xxxxxxxx.150760.000.stack.txt为例

<StackTrace>

-------Frame------ ------Function + Offset------

0x090000000002A678 pwrite64 + 0x38

0x09000000078AFC7C .sqloseekwrite64.fdpr.clone.29 + 0x134

0x09000000078AFABC sqloWriteBlocks + 0xD4

0x09000000078AF960 sqlbWriteBlocks__FP16SqlbOpenFileInfoPvlUlUiPUlP12SQLB_GLOBALS + 0x50

0x09000000078AE83C sqlbWritePageToDisk__FP11SQLO_IO_REQP16SqlbOpenFileInfoP12SQLB_GLOBALS + 0x190

0x09000000078AE60C sqlbWritePage + 0x100

0x09000000078AE45C sqlbWritePage@glue4FC + 0x78

0x09000000078D0728 .@71@sqlbCheckVictimSlot.fdpr.clone.1873__FP12SQLB_GLOBALSP15SQLB_BufferPoolP19SQLB_PAGE_VICTIM_CBP11SQLB_FIX_CBP8SQLB_BPDP13SQLB_BPD_BASEPP8SQLB_BPD + 0x1C

0x09000000078CEDF4 .sqlbFreeUpSlot.fdpr.clone.231__FP11SQLB_FIX_CBPUiPP8SQLB_BPD + 0xC68

0x09000000078CF1A0 sqlbGetVictimSlot__FP11SQLB_FIX_CBiPUiPP8SQLB_BPDUl + 0x250

0x09000000078D3970 .sqlbGetPageFromDisk.fdpr.clone.101__FP11SQLB_FIX_CBi + 0xD0

0x09000000078D637C sqlbfix__FP11SQLB_FIX_CB + 0xCC0

0x09000000078D6A68 @147@497002@sqlbFixPage__FP11SQLB_FIX_CBiUiT2 + 0x54

0x09000000078B5664 sqlbFixNextNewPage__FP11SQLB_FIX_CBbUiT3@glueDB2 + 0xE0

0x0900000007972D58 sqldAppendTempRow__FP8sqeAgentPciT3P8SQLZ_RIDP8SQLD_TCB + 0x2E0

0x0900000007972820 @101@sqldInsertTemp__FP8sqeAgentP8SQLD_TCBUlT3PP10SQLD_VALUEP8SQLZ_RID + 0x1B0

0x09000000078868C4 sqldRowInsert__FP8sqeAgentUsT2UcUliPP10SQLD_VALUEP13SQLD_TDATARECP8SQLZ_RID + 0x30

0x090000000A3016AC sqlri_hsjnFlushSinglePartition__FP8sqlrr_cbP11sqlri_hsjnoUiP20sqlri_hsjnTupleBlock + 0x97C

0x090000000A300188 @44@sqlri_hsjnFlushBlocks__FP8sqlrr_cbP11sqlri_hsjnoP18sqlri_shared_hsjnoPP20sqlri_hsjnTupleBlockliT6 + 0x5E8

0x09000000099F5DB8 sqlri_hsjnPartitionFull__FP8sqlrr_cbiP25sqlri_hsjnPartitionBucketP11sqlri_hsjnoP18sqlri_shared_hsjno + 0x43C

0x090000000788F824 sqlri_hsjnNewTuple__FP8sqlrr_cbP11sqlri_hsjno + 0x24

0x090000000788FF64 sqlrihsjnpd__FP8sqlrr_cb + 0x84

0x09000000078FD2EC sqldEvalDataPred__FP13SQLD_DFM_WORKPUlP10SQLD_DPRED + 0x240

0x09000000078FA970 sqldReadNorm__FP13SQLD_DFM_WORKl + 0x540

0x09000000078A0248 @101@sqldFetchNext__FP13SQLD_DFM_WORK + 0x190

0x09000000078A0938 sqldFetchNext__FP13SQLD_DFM_WORK@glueD8F + 0x74

0x09000000078A0B64 sqldfrd__FP13SQLD_DFM_WORK + 0x170

0x09000000078EB5C0 sqldfrd__FP13SQLD_DFM_WORK@glue5A9 + 0x74

0x09000000078EB400 sqldRowFetch__FP8sqeAgentP8SQLD_CCBUlT3PP10SQLD_VALUEP8SQLZ_RIDT3P12SQLD_ID_LISTP9SQLP_LSN8 + 0xF0

0x09000000078ECDEC sqlritaSimplePerm__FP8sqlrr_cb + 0x184

0x09000000078F3624 sqlriSectInvoke__FP8sqlrr_cbP12sqlri_opparm + 0x30

0x09000000080A8098 sqlrr_dss_router__FP8sqlrr_cb + 0x6F4

0x090000000937FAA8 sqlrr_subagent_router__FP8sqeAgentP12SQLE_DB2RA_T + 0x5CC

0x09000000086F0FE4 sqleSubRequestRouter__FP8sqeAgentPUiT2 + 0x3FC

0x09000000075E4F60 sqleProcessSubRequest__FP8sqeAgent + 0x2D5C

0x0900000007D05BEC RunEDU__8sqeAgentFv + 0xEC

0x0900000007793418 EDUDriver__9sqzEDUObjFv + 0xF4

0x090000000778A1CC sqloEDUEntry + 0x264

</StackTrace>

<LatchInformation>

Holding Latch type: (SQLO_LT_SQLB_POOL_CB__writeLatch) - Address: (7000000490fb838), Line: 2476, File: sqlbpacc.C HoldCount: 1

Holding Latch type: (SQLO_LT_SQLD_APP_CB__inDMSlatch) - Address: (700000027860098), Line: 7052, File: /view/db2_v97fp8_aix64_s130316/vbs/engn/include/sqldinlines.h HoldCount: 1

</LatchInformation>

结论

综上:所以问题应该是由临时表空间过大->运行监控函数时间长,一直持有latch->新的应用不能连接。在HANG期间,一般建议至少要收集db2pd -eve和db2pd -stack all的数据,由于没有db2pd -eve的数据,看不到出问题时候的临时表空间使用量。不过,在后续的监控中,果然发现出问题时临时表空间使用量在疯狂增长。找到应用对应的SQL,并对其进行调优之后,问题消失。

另外,有一个类似的文章,是原IBM Db2开发人员郝庆运写的,也是由于临时表空间比较大,导致应用性能问题,如下:
https://www.ibm.com/developerworks/cn/data/library/ba-cn-db2-tempspace-oversize-performance/index.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  db2 性能