公司搬家,所有服务器的ip改变。对ceph服务器配置好ip后启动,发现monitor进程启动失败,monitor进程总是试图绑定到以前的ip地址,那当然不可能成功了。开始以为服务器的ip设置有问题,在改变hostname、ceph.conf等方法无果后,逐步分析发现,是monmap中的ip地址还是以前的ip,ceph通过读取monmap来启动monitor进程,所以需要修改monmap。方法如下:
#Addthenewmonitorlocations #monmaptool--create--addmon0192.168.32.2:6789--addosd1192.168.32.3:6789\ --addosd2192.168.32.4:6789--fsid61a520db-317b-41f1-9752-30cedc5ffb9a\ --clobbermonmap #Retrievethemonitormap #cephmongetmap-omonmap.bin #Checknewcontents #monmaptool--printmonmap.bin #Injectthemonmap #ceph-mon-imon0--inject-monmapmonmap.bin #ceph-mon-iosd1--inject-monmapmonmap.bin #ceph-mon-iosd2--inject-monmapmonmap.bin
再启动monitor,一切正常。
但出现了上一篇文章中描述的一块osd盘挂掉的情况。查了一圈,只搜到ceph的官网上说是ceph的一个bug。无力修复,于是删掉这块osd,再重装:
#servicecephstoposd.4 #不必执行cephosdcrushremoveosd.4 #cephauthdelosd.4 #cephosdrm4 #umount/cephmp1 #mkfs.xfs-f/dev/sdc #mount/dev/sdc/cephmp1 #此处执行create无法正常安装osd #ceph-deployosdprepareosd2:/cephmp1:/dev/sdf1 #ceph-deployosdactivateosd2:/cephmp1:/dev/sdf1
完成后重启该osd,成功运行。ceph会自动平衡数据,***的状态是:
[root@osd2~]#ceph-s cluster61a520db-317b-41f1-9752-30cedc5ffb9a healthHEALTH_WARN9pgsincomplete;9pgsstuckinactive;9pgsstuckunclean;3requestsareblocked>32sec monmape3:3monsat{mon0=192.168.32.2:6789/0,osd1=192.168.32.3:6789/0,osd2=192.168.32.4:6789/0},electionepoch76,quorum0,1,2mon0,osd1,osd2 osdmape689:6osds:6up,6in pgmapv189608:704pgs,5pools,34983MBdata,8966objects 69349MBused,11104GB/11172GBavail 695active+clean 9incomplete
出现了9个pg的incomplete状态。
[root@osd2~]#cephhealthdetail HEALTH_WARN9pgsincomplete;9pgsstuckinactive;9pgsstuckunclean;3requestsareblocked>32sec;1osdshaveslowrequests pg5.95isstuckinactivefor838842.634721,currentstateincomplete,lastacting[1,4] pg5.66isstuckinactivesinceforever,currentstateincomplete,lastacting[4,0] pg5.deisstuckinactivefor808270.105968,currentstateincomplete,lastacting[0,4] pg5.f5isstuckinactivefor496137.708887,currentstateincomplete,lastacting[0,4] pg5.11isstuckinactivesinceforever,currentstateincomplete,lastacting[4,1] pg5.30isstuckinactivefor507062.828403,currentstateincomplete,lastacting[0,4] pg5.bcisstuckinactivesinceforever,currentstateincomplete,lastacting[4,1] pg5.a7isstuckinactivefor499713.993372,currentstateincomplete,lastacting[1,4] pg5.22isstuckinactivefor496125.831204,currentstateincomplete,lastacting[0,4] pg5.95isstuckuncleanfor838842.634796,currentstateincomplete,lastacting[1,4] pg5.66isstuckuncleansinceforever,currentstateincomplete,lastacting[4,0] pg5.deisstuckuncleanfor808270.106039,currentstateincomplete,lastacting[0,4] pg5.f5isstuckuncleanfor496137.708958,currentstateincomplete,lastacting[0,4] pg5.11isstuckuncleansinceforever,currentstateincomplete,lastacting[4,1] pg5.30isstuckuncleanfor507062.828475,currentstateincomplete,lastacting[0,4] pg5.bcisstuckuncleansinceforever,currentstateincomplete,lastacting[4,1] pg5.a7isstuckuncleanfor499713.993443,currentstateincomplete,lastacting[1,4] pg5.22isstuckuncleanfor496125.831274,currentstateincomplete,lastacting[0,4] pg5.deisincomplete,acting[0,4] pg5.bcisincomplete,acting[4,1] pg5.a7isincomplete,acting[1,4] pg5.95isincomplete,acting[1,4] pg5.66isincomplete,acting[4,0] pg5.30isincomplete,acting[0,4] pg5.22isincomplete,acting[0,4] pg5.11isincomplete,acting[4,1] pg5.f5isincomplete,acting[0,4] 2opsareblocked>8388.61sec 1opsareblocked>4194.3sec 2opsareblocked>8388.61seconosd.0 1opsareblocked>4194.3seconosd.0 1osdshaveslowrequests
查了一圈无果。一个有同样遭遇的人的一段话:
Ialreadytried"cephpgrepair4.77",stop/startOSDs,"cephosdlost","cephpgforce_create_pg4.77". Mostscarythingis"force_create_pg"doesnotwork.AtleastitshouldbeawaytowipeoutaincompletePG withoutdestroyingawholepool.
以上方法尝试了一下,都不行。暂时无法解决,感觉有点坑。
PS:常用pg操作
[root@osd2~]#cephpgmap5.de osdmape689pg5.de(5.de)->up[0,4]acting[0,4] [root@osd2~]#cephpg5.dequery [root@osd2~]#cephpgscrub5.de instructingpg5.deonosd.0toscrub [root@osd2~]#cephpg5.demark_unfound_lostrevert pghasnounfoundobjects #cephpgdump_stuckstale #cephpgdump_stuckinactive #cephpgdump_stuckunclean [root@osd2~]#cephosdlost1 ErrorEPERM:areyouSURE?thismightmeanreal,permanentdataloss.pass--yes-i-really-mean-itifyoureallydo. [root@osd2~]# [root@osd2~]#cephosdlost4--yes-i-really-mean-it osd.4isnotdownordoesn'texist [root@osd2~]#servicecephstoposd.4 ===osd.4=== StoppingCephosd.4onosd2...kill22287...kill22287...done [root@osd2~]#cephosdlost4--yes-i-really-mean-it markedosdlostinepoch690 [root@osd1mnt]#cephpgrepair5.de instructingpg5.deonosd.0torepair [root@osd1mnt]#cephpgrepair5.de instructingpg5.deonosd.0torepair
转载请注明:IT运维空间 » 运维技术 » IP改变引起的Ceph monitor异常及OSD盘崩溃的总结
发表评论