IT項目啟示錄——來自泰坦尼克號的教訓(第十一篇)
2005/12/14 17:47:05?|? 2491次閱讀?|? 來源:原創(chuàng)?? 【已有0條評論】發(fā)表評論
文/Mark Kozak-Holland 譯/楊磊
回顧一下泰坦尼克號當時的情形:撞擊發(fā)生后(見第8部分)船體搖晃駛離冰架,重新啟航,開向海爾法客斯。一切都似乎無礙,但8節(jié)航速下20分鐘后,當初的決策有多不準確就已經很顯見了。續(xù)航的行動終嘗惡果,船進了更多的水。其他本未受撞擊影響的部分也在水壓下開始漏水了。上漲的海水正演變成一場大浩劫。
如今,第一要務是邊確定永久性的修復方案,邊通過臨時性的補救措施來使服務迅速恢復上線。但是,此時根本之處在于,應密切監(jiān)視服務環(huán)境,觀察補救措施是否見效。
包括結構師托馬斯-安得魯斯和木匠約翰-哈金斯的第二調查組,報告說有5個船部的主體被淹了,并認為這大違泰坦尼克號的設計初衷。沿船底的摩擦已嚴重撕裂了外殼并損壞了雙層船體。6個主要船部進水速度的不同,也說明頂部船體已損。事態(tài)竟然會糟糕到如此境地,這超出了設計者的預想。
在如今的IT項目中,至關重要的是項目團隊要對這樣一類任何補救措施都無濟于事、事態(tài)發(fā)展將超出MTTR規(guī)程(見第9部分)的不測,預作計劃。對最終用戶和客戶,服務中斷了且難于修復。針對這樣的情形,在項目之內就應建立、準備、計劃、測試災難恢復規(guī)程(見第4部分),并且配以專人(運行團隊/技術支持)使之制度化。
結構師意識到,泰坦尼克號狀況已超一般的事故恢復范圍,已演變成一場大浩劫。他說,船離沉沒還有2個半小時到3個小時。并準確認定已無力回天。太多的船部破裂,水淹至抽水機都不及挽救。各船部之間的防水隔墻,沒做到水密水平橫斷線的高度,所以當船鼻下沉時,水從一個船部滲進另一個,就像水浸過制冰格盤一樣。舞廳實際上成為讓水向各部分派發(fā)的大通道。
此時我們已可發(fā)現(xiàn),項目建設階段(見第3部分)在非功能性需求上的那種妥協(xié),在這場浩劫中是如何引發(fā)巨大惡果的。
只有船長和部分指揮官確知損壞程度,而眼下只能眼睜睜看著船的下沉。沒有發(fā)出過“棄船”或其它正式的災難公告。只在撞擊后的65分鐘時,船長命指揮官們打開救生艇的遮布,并讓乘客和船員們都到甲板上。泰坦尼克號上沒有正式的災難恢復計劃。
如果發(fā)生在今天,接下來應啟動災難恢復計劃,并向所有人溝通該計劃。每個災難恢復計劃都應有考慮周全的溝通計劃,需向不同的聽眾清楚無疑地進行溝通。
泰坦尼克號的船長在碰撞后很快就明白了問題的嚴重性,但是,他沒有通過其船員與乘客們完成溝通。這船上人們的困惑加劇了,尤其是船員們。比如,引擎室向甲板派出了工程師,可指揮部卻讓他們返回去。對船上這樣糟糕的溝通問題,可能的解釋有:
●船上裝備的溝通系統(tǒng)有限,沒有公告系統(tǒng)。重要信息只能通過船員們到各個艙位敲門后口傳給乘客??紤]到艙位數(shù)以百計,這太費時了。
●船員們本身就對實情不清楚,所以乘客們所能知曉的就莫衷一是。這個老船長對船體的安全系統(tǒng)太有信心,也許難于相信結構師的判斷,因此開始的時候一切似乎都還正常。船長的表現(xiàn)幾乎就相當于好像一切正常。
●船長深知救生艇數(shù)量不敷所需,大約只夠帶走全船2223人中的一半。所以,也許最好還是不制造恐慌,而在適當時候讓救生艇在一片平和中有秩序地載走乘客。船體水平狀的結構,和艙位等級的界別,意味著頭等艙的乘客們可更優(yōu)先得到救生艇位。
●船長擔心恐慌的擴散。他同下屬都知道14年前法國客輪La Bourgogne下沉的故事。當時也只有一半乘客有救生艇位,引發(fā)一片恐慌。史密斯船長知道,他可以通過讓那些足夠幸運者都上到救生艇上,來挽救盡量多的人。所以,他沒告訴所有乘客,尤其是3等艙的那些人。
如今,溝通計劃可能與災難恢復計劃一樣重要。原因如下:
●與雇員的內部溝通極有助于控制災難的影響度。同時,溝通的速度也很重要,比如可首先讓面向客戶的那些雇員獲悉訊息,因而他們能轉達客戶。
●與客戶的外部溝通也很重要。溝通計劃需要根據(jù)問題或災難的大小范圍,以不同渠道來向顧客各個層級傳達。
●根據(jù)服務中斷的嚴重程度,和公眾媒體的溝通也許是必要的。這需要確定什么是關鍵信息,如何溝通發(fā)布,通過什么渠道。許多公司不再設防,流動通信員帶著一些陷阱問題訪問不知情的雇員們。
結論
如今,許多IT項目由于沒有對最壞情況準備對策,而在運行中大打折扣。光有MTTR規(guī)程還不夠。除了災難恢復計劃,一個考慮周全的溝通計劃也必須到位。下一部分將著眼于災難恢復的啟動。
原文:
In recapping Titanic’s situation, following the collision (Part 8) the ship was restarted and limped off the ice shelf with the objective of sailing back to Halifax. Everything appeared to be in good shape, but after 20 minutes of sailing at 8 knots it was apparent that the initial determination was grossly inaccurate. The forward motion had taken its toll and the ship had taken on more water. Parts of the ship initially unaffected under the strain of the water had started to spring leaks and the increase in flooding was becoming catastrophic.
In today’s world, getting service back online is a top priority by applying a temporary fix whilst a permanent fix is created. However, in such a situation it is essential the service delivery environment is closely monitored to whether the fix is holding.
The second search party, with the architect Thomas Andrews and the carpenter John Hutchinson, reported major flooding in five compartments and recognized that Titanic was not designed for this. The grinding along the bottom had badly ruptured the outer skin and damaged the double hull. The different rates of flooding in the six primary compartments indicated the top hull or tank top was damaged. It was beyond the expectations of the designer that something in nature could inflict so much damage.
In today’s IT projects, it is vital that the project team plan for such an eventuality where the fix is not resolving the problem and the situation goes beyond the Mean Time To Recovery (MTTR) for the IT solution (see Part 9). The service is unavailable, to end-users and customers, and not readily recoverable any more. For this situation disaster recovery procedures need to be set up, prepared, planned and tested in the project itself (Part 4) and "institutionalized" with the staff (operations groups/technical support).
The architect realized the situation onboard Titanic had gone beyond normal problem recovery and had become a disaster. He stated that the ship had 2.5 to 3 hours before completely sinking, and accurately determined that the problem could not be fixed. Too many compartments were ruptured and were rapidly flooding beyond the capacity of all the pumps. The bulkhead walls, separating the compartments, had not been carried up to watertight horizontal traverses. Therefore, as the ship’s nose went down, water spilled from one compartment to another rather like an ice cube tray filling with water. The ballroom acted as massive channel for distributing water horizontally across the ship.
At this point in the story we see how the compromises to the non-functional requirements during the construction phase (see Part 3) of the project had a massive consequence in the disaster.
Only the captain and a few officers knew the extent of the damage and were now resigned to the ship sinking. No "abandon ship" command or formal declaration of a disaster was given. Around 65 minutes after the collision the captain just gave orders to the officers to uncover the lifeboats and get the passengers and crew ready on deck. No formalized disaster recovery plan was in place on board Titanic.
In today’s world, the next step would be to invoke a disaster recovery plan and communicate it to all onboard. Every disaster recovery plan needs to be accompanied with a well-thought-out communication plan. This needs to clearly communicate with different audiences.
Titanic’s captain knew the seriousness of the situation relatively quickly from the collision, but did not communicate this through the ranks of crew and passengers on board. This increased the confusion, particularly with the crew. For example, the engine room sent some engineers to the boat deck, but the bridge sent them back down to the engine room. There are number of possible explanations for the poor communication aboard Titanic:
·The ship had very limited communication, with no public-address systems. Important information was communicated to passengers by word of mouth, the crew knocking on each cabin door and common room. Considering there were hundreds of cabins, this could take hours.
·The crew didn’t have accurate information on the situation, so varying degrees of information were passed to passengers. The experienced captain believed in the safety systems of the ship and might have found the architect’s verdict very hard to accept because everything appeared so normal in the first hour. The captain acted almost as if the situation was "business as usual."
·The captain realized that the carrying capacity of the lifeboats was inadequate, with only enough room for about half of the estimated 2,223 people on board. Perhaps better to keep things calm, and allow the lifeboats to be filled in an orderly manner when the timing was right. The ship’s hierarchical structure and segregation of classes meant that first-class passengers had the best access to the boats.
·The captain feared widespread panic. He and the other officers were aware of the French liner La Bourgogne, which sank 14 years earlier. With room in the lifeboats for only half the people onboard, widespread panic had broken out. Captain Smith knew he could save the maximum number of lives by loading only those who were lucky enough to reach the boats. So, he may have avoided informing all the passengers, specifically in third class.
In today’s world a communication plan is probably as important as a disaster recovery plan, for several reasons:
·Communicating internally with your employees can greatly help control the impact of a disaster. Also, the speed of communication is essential. For example, get information to customer-facing employees first, so they can inform customers.
·Communicating externally with your customers is essential and the plan needs to cater to customer segments using different channels, depending on the scope of the problem or disaster. A customer-retention strategy might need to be offered.
·Communicating with the press may be necessary depending on how serious the loss of service is. This requires the identification of key messages, how these are communicated, and through what channels. Many companies have been caught off guard when roving reporters trap unaware employees with questions.
Conclusions
Today, many IT projects severely compromise an operation by not preparing for worst case scenarios. In today’s world, MTTR procedures are not enough. Aside from a disaster recovery plan, a well-thought-out communication plan needs to be in place. The next installment will look at invoking disaster recovery.
【?發(fā)表評論?0條?】
請您注意·自覺遵守:愛國、守法、自律、真實、文明的原則
·尊重網上道德,遵守《全國人大常委會關于維護互聯(lián)網安全的決定》及中華人民共和國其他各項有關法律法規(guī)
·嚴禁發(fā)表危害國家安全,破壞民族團結、國家宗教政策和社會穩(wěn)定,含侮辱、誹謗、教唆、淫穢等內容的作品
·承擔一切因您的行為而直接或間接導致的民事或刑事法律責任
·您在中國項目管理資源網新聞評論發(fā)表的作品,中國項目管理資源網有權在網站內保留、轉載、引用或者刪除
·參與本評論即表明您已經閱讀并接受上述條款