今天检查ceph集群,发现有pg丢失,于是就有了本文~~~
1.查看集群状态
[root@k8snode001~]#cephhealthdetail HEALTH_ERR1/973013objectsunfound(0.000%);17scruberrors;Possibledatadamage:1pgrecovery_unfound,8pgsinconsistent,1pgrepair;Degradeddataredundancy:1/2919039objectsdegraded(0.000%),1pgdegraded OBJECT_UNFOUND1/973013objectsunfound(0.000%) pg2.2bhas1unfoundobjects OSD_SCRUB_ERRORS17scruberrors PG_DAMAGEDPossibledatadamage:1pgrecovery_unfound,8pgsinconsistent,1pgrepair pg2.2bisactive+recovery_unfound+degraded,acting[14,22,4],1unfound pg2.44isactive+clean+inconsistent,acting[14,8,21] pg2.73isactive+clean+inconsistent,acting[25,14,8] pg2.80isactive+clean+scrubbing+deep+inconsistent+repair,acting[4,8,14] pg2.83isactive+clean+inconsistent,acting[14,13,6] pg2.aeisactive+clean+inconsistent,acting[14,3,2] pg2.c4isactive+clean+inconsistent,acting[8,21,14] pg2.daisactive+clean+inconsistent,acting[23,14,15] pg2.faisactive+clean+inconsistent,acting[14,23,25] PG_DEGRADEDDegradeddataredundancy:1/2919039objectsdegraded(0.000%),1pgdegraded pg2.2bisactive+recovery_unfound+degraded,acting[14,22,4],1unfound
从输出发现pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
现在我们来查看pg 2.2b,看看这个pg的想想信息。
[root@k8snode001~]#cephpgdump_jsonpools|grep2.2b dumpedall 2.2b24871101953319840330483048active+recovery_unfound+degraded2020-07-2308:56:07.66990310373'544837010373:7312614[14,22,4]14[14,22,4]1410371'54372582020-07-2308:56:06.63701210371'54372582020-07-2308:56:06.6370120
可以看到它现在只有一个副本
2.查看pg map
[root@k8snode001~]#cephpgmap2.2b osdmape10373pg2.2b(2.2b)->up[14,22,4]acting[14,22,4]
从pg map可以看出,pg 2.2b分布到osd [14,22,4]上
3.查看存储池状态
[root@k8snode001~]#cephosdpoolstatsk8s-1 poolk8s-1id2 1/1955664objectsdegraded(0.000%) 1/651888objectsunfound(0.000%) clientio271KiB/swr,0op/srd,52op/swr [root@k8snode001~]#cephosdpoollsdetail|grepk8s-1 pool2'k8s-1'replicatedsize3min_size1crush_rule0object_hashrjenkinspg_num256pgp_num256last_change88flagshashpspool,selfmanaged_snapsstripe_width0applicationrbd
4.尝试恢复pg 2.2b丢失地块
[root@k8snode001~]#cephpgrepair2.2b
如果一直修复不成功,可以查看卡住PG的具体信息,主要关注recovery_state,命令如下
[root@k8snode001~]#cephpg2.2bquery { "...... "recovery_state":[ { "name":"Started/Primary/Active", "enter_time":"2020-07-2114:17:05.855923", "might_have_unfound":[], "recovery_progress":{ "backfill_targets":[], "waiting_on_backfill":[], "last_backfill_started":"MIN", "backfill_info":{ "begin":"MIN", "end":"MIN", "objects":[] }, "peer_backfill_info":[], "backfills_in_flight":[], "recovering":[], "pg_backend":{ "pull_from_peer":[], "pushing":[] } }, "scrub":{ "scrubber.epoch_start":"10370", "scrubber.active":false, "scrubber.state":"INACTIVE", "scrubber.start":"MIN", "scrubber.end":"MIN", "scrubber.max_end":"MIN", "scrubber.subset_last_update":"0'0", "scrubber.deep":false, "scrubber.waiting_on_whom":[] } }, { "name":"Started", "enter_time":"2020-07-2114:17:04.814061" } ], "agent_state":{} }
如果repair修复不了;两种解决方案,回退旧版或者直接删除
5.解决方案
回退旧版 [root@k8snode001~]#cephpg2.2bmark_unfound_lostrevert 直接删除 [root@k8snode001~]#cephpg2.2bmark_unfound_lostdelete
6.验证
我这里直接删除了,然后ceph集群重建pg,稍等会再看,pg状态变为active+clean
[root@k8snode001~]#cephpg2.2bquery { "state":"active+clean", "snap_trimq":"[]", "snap_trimq_len":0, "epoch":11069, "up":[ 12, 22, 4 ],
再次查看集群状态
[root@k8snode001~]#cephhealthdetail HEALTH_OK
转载请注明:IT运维空间 » 运维技术 » 记一次Ceph pg unfound处理过程
发表评论