Molet

记一次Ceph pg unfound处理过程

Molet 运维技术 2022-11-09 411浏览 0

记一次Ceph pg unfound处理过程

今天检查ceph集群,发现有pg丢失,于是就有了本文~~~

1.查看集群状态

[root@k8snode001~]#cephhealthdetail
HEALTH_ERR1/973013objectsunfound(0.000%);17scruberrors;Possibledatadamage:1pgrecovery_unfound,8pgsinconsistent,1pgrepair;Degradeddataredundancy:1/2919039objectsdegraded(0.000%),1pgdegraded
OBJECT_UNFOUND1/973013objectsunfound(0.000%)
pg2.2bhas1unfoundobjects
OSD_SCRUB_ERRORS17scruberrors
PG_DAMAGEDPossibledatadamage:1pgrecovery_unfound,8pgsinconsistent,1pgrepair
pg2.2bisactive+recovery_unfound+degraded,acting[14,22,4],1unfound
pg2.44isactive+clean+inconsistent,acting[14,8,21]
pg2.73isactive+clean+inconsistent,acting[25,14,8]
pg2.80isactive+clean+scrubbing+deep+inconsistent+repair,acting[4,8,14]
pg2.83isactive+clean+inconsistent,acting[14,13,6]
pg2.aeisactive+clean+inconsistent,acting[14,3,2]
pg2.c4isactive+clean+inconsistent,acting[8,21,14]
pg2.daisactive+clean+inconsistent,acting[23,14,15]
pg2.faisactive+clean+inconsistent,acting[14,23,25]
PG_DEGRADEDDegradeddataredundancy:1/2919039objectsdegraded(0.000%),1pgdegraded
pg2.2bisactive+recovery_unfound+degraded,acting[14,22,4],1unfound

从输出发现pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound

现在我们来查看pg 2.2b,看看这个pg的想想信息。

[root@k8snode001~]#cephpgdump_jsonpools|grep2.2b
dumpedall
2.2b24871101953319840330483048active+recovery_unfound+degraded2020-07-2308:56:07.66990310373'544837010373:7312614[14,22,4]14[14,22,4]1410371'54372582020-07-2308:56:06.63701210371'54372582020-07-2308:56:06.6370120

可以看到它现在只有一个副本

2.查看pg map

[root@k8snode001~]#cephpgmap2.2b
osdmape10373pg2.2b(2.2b)->up[14,22,4]acting[14,22,4]

从pg map可以看出,pg 2.2b分布到osd [14,22,4]上

3.查看存储池状态

[root@k8snode001~]#cephosdpoolstatsk8s-1
poolk8s-1id2
1/1955664objectsdegraded(0.000%)
1/651888objectsunfound(0.000%)
clientio271KiB/swr,0op/srd,52op/swr

[root@k8snode001~]#cephosdpoollsdetail|grepk8s-1
pool2'k8s-1'replicatedsize3min_size1crush_rule0object_hashrjenkinspg_num256pgp_num256last_change88flagshashpspool,selfmanaged_snapsstripe_width0applicationrbd

4.尝试恢复pg 2.2b丢失地块

[root@k8snode001~]#cephpgrepair2.2b

如果一直修复不成功,可以查看卡住PG的具体信息,主要关注recovery_state,命令如下

[root@k8snode001~]#cephpg2.2bquery
{
"......
"recovery_state":[
{
"name":"Started/Primary/Active",
"enter_time":"2020-07-2114:17:05.855923",
"might_have_unfound":[],
"recovery_progress":{
"backfill_targets":[],
"waiting_on_backfill":[],
"last_backfill_started":"MIN",
"backfill_info":{
"begin":"MIN",
"end":"MIN",
"objects":[]
},
"peer_backfill_info":[],
"backfills_in_flight":[],
"recovering":[],
"pg_backend":{
"pull_from_peer":[],
"pushing":[]
}
},
"scrub":{
"scrubber.epoch_start":"10370",
"scrubber.active":false,
"scrubber.state":"INACTIVE",
"scrubber.start":"MIN",
"scrubber.end":"MIN",
"scrubber.max_end":"MIN",
"scrubber.subset_last_update":"0'0",
"scrubber.deep":false,
"scrubber.waiting_on_whom":[]
}
},
{
"name":"Started",
"enter_time":"2020-07-2114:17:04.814061"
}
],
"agent_state":{}
}

如果repair修复不了;两种解决方案,回退旧版或者直接删除

5.解决方案

回退旧版
[root@k8snode001~]#cephpg2.2bmark_unfound_lostrevert
直接删除
[root@k8snode001~]#cephpg2.2bmark_unfound_lostdelete

6.验证

我这里直接删除了,然后ceph集群重建pg,稍等会再看,pg状态变为active+clean

[root@k8snode001~]#cephpg2.2bquery
{
"state":"active+clean",
"snap_trimq":"[]",
"snap_trimq_len":0,
"epoch":11069,
"up":[
12,
22,
4
],

再次查看集群状态

[root@k8snode001~]#cephhealthdetail
HEALTH_OK

继续浏览有关 系统运维 的文章
发表评论