Wednesday, November 6, 2013

A Tester's Perspective: What Went Wrong with Healthcare.gov?

Background

By now everybody has heard of the healthcare.gov website rollout. By all accounts this initial rollout has been a colossal failure. In this multi-part blog post we will examine some of the key issues that have caused this rollout to go awry.

In a software project of this size it’s no surprise that there are issues with the initial rollout. However the magnitude of these issues has been quite surprising to IT professionals and the layperson alike. To give some context, let’s look at some key findings of large-scale IT projects. In 2012, Standish Group International conducted a survey of large-scale IT projects. They found that projects valued over $10 million that less than 10% of those projects were completed successfully - meaning on time and within budget. To bring this a little closer to home, the survey found that 48% of all federal IT projects had to be re-baselined. In layman’s terms this means that they had to restructure the projects because of cost overruns and/or a change in project goals. Of the projects that had to be re-baselined, it interesting to note that more than half had to be re-baselined more than one time. This does not bode well for a federally run IT project. The department of Health and Human Services history is even more troublesome. As of 2008, 43% of their projects were on the Office of Management and Budget’s “watchlist” because of poor performance and other management issues.

Before we can evaluate what went wrong with the website rollout, we have to evaluate what its purpose was. By most accounts the purpose of the website is to provide a centralized location through which folks seeking health insurance can secure health insurance. By the Department of Health and Human Services estimates they are looking to serve 47 million uninsured customers however they set a much lower goal of only actually enrolling 7 million by March 1, 2014.

First Reported Symptoms

Before we dive into specific problems, I do want to point out some notable headlines that made the news. Most people seem to assume that the problems with the website deal solely with scalability and performance; however, actual site traffic tells a different story. In the first week of the website went live, they had 9.47 million users. Of those who visited the site, many of them experience page wait times of greater than 8 seconds, and less than 3% of them could actually create an account. In a 2009 Aberdeen study, they found that 40% of users will abandon a website after waiting as little as 2 seconds. Within 8 second wait time, an inability to create an account; it is of no surprise that the website is failing to meet expectations.

So the big question is, what went wrong with the site, and what could have been done to prevent it?
Before we can detail specific defects within the website is important to point out the general process flow that users are expected to follow when using the website. That flow is depicted below: 



As you can see, users are expected to go to the healthcare.gov landing page. This is the page that really kicks off what the user can and should do next. Primarily the user should create an account through the registration process. This includes setting a user ID and password and providing other pieces of personal information.  It also includes an email verification step. From there they go through a process to verify their identity which involves calls to credit bureaus. From this point, eligibility for government subsidies is determined, which is largely a background operation, and then the user can request healthcare quotes. Once they receive quotes, they can then choose coverage and complete the enrollment and payment processes. Once confirmed with the insurance company, they then have coverage. Obviously, this is a simplified version of the overall process but one that most people can understand and identify with. We will detail some of these defects of various steps in this process in this and future blog post.

The Landing Page


Performance


Most public of these issues with the healthcare government website have been performance related. As we have already pointed out, users were experiencing and 8+ second wait time. The government has publicly stated that the hardware was sized to handle between 50k and 60k concurrent users. The government has also stated that the code was designed on the Medicaid Part D rollout which was designed to handle 30k concurrent users. Meanwhile, the government says they actually experienced 250k concurrent users. These conflicting statements begin to imply that performance was a concern but still treated as an afterthought in the software development process. As a side note I should point out that when websites fail they tend to be capable of handling more concurrent users due to the simple fact that they are not processing what they should - and in this case returning a static error page is far easier on the servers than actually processing requests. In addition it is doubtful that there was a standard definition of a concurrent user for this particular project. So it is of no surprise that the government is reporting five times as many users as expected.

As a side not, and solely based on observation, certainly uncorroborated, it would appear the government implemented a waiting room. This waiting room served as a controlled access point to limit the number of concurrent users actually exercising the servers. With this waiting room, users were not given the visibility to queue lengths or wait times. Often times, users exited and re-entered the queue during the registration process. This may be some of the reasons for such the low registration rate.

Performance Summary

  • 8+ second wait times
  • Time outs, errors, etc…
  • Software designed for 30k concurrent users
  • Hardware designed for 60k concurrent users
  • “Waiting Room” had to be implemented

The impact

  • The inability to register an account
  • The inability to review coverages
  • The addition of people to process paper applications (8000 in the first week)
  • Leading story on all major news outlets
  • Trending topic on multiple social media outlets

Testing that should have been done


Obviously, with a site of this size and complexity, comprehensive performance testing needed to be done. However, it is apparent from the government’s own reports that adequate performance testing was never performed. In fact, the first end-to-end test occurred only one month prior to the go live date. During this end to end test, they discovered over 200 additional defects and crashed the system multiple times.
It cannot be stated enough, testers need adequate test environments to perform testing. For performance testing, these environments either need to be comparable to production or they need constructed in such a way as to be analogous to production. Not only that, environment stability is fundamental to executing a good performance test.

Security


Most of this section is attributable to the findings of Ben Simo, Inc. can be found on his blog http://blog.isthereaproblemhere.com/

Security of the healthcare website is of particular concern in the testing and the IT security world; however it is the most underreported failure of the website. Fortunately the inability to create accounts is going to prove a blessing in disguise for the government. The sheer volume of security vulnerabilities in the website is truly astounding. In fact there are so many critical security vulnerabilities in the website that it is hard to label just one as the most critical. While I’m not a big fan of using “best practices”, there are certainly bad practices to avoid. The first of those being, storing email addresses and passwords unencrypted within the header traffic of the website. Second, I’d have to say it is sending email addresses and password reset tokens to third parties. And finally, it would be the 471 pieces of identifiable information that they pass back and forth between every webpage on the website including: birthdate, address, name, date of marriage, noncustodial parent information, absent parent information, etc…

Security Summary

These are just some of the most egregious security errors on the website, however there are more. A summary of some of the critical security errors is listed below:
  • 471 identifiable pieces of information stored in the browser for every web page including birth date, address, name, marriage date, non-custodial parent information, absent parent information, etc…
  • Email addresses and password reset codes to 3rd parties
  • Using a single password reset code per account versus random generated code for each reset request
  • Ability to answer security questions by “ex’s” – for example, your favorite radio station
  • Displaying full stack trace errors to users
  • Displaying of the username when providing email address input
  • Displaying of a user's security questions when providing valid user id

The impact

  • Fortunately there is not been any disclosures of personally identifiable information.
  • There is a lack of security around personally identifiable information
  • Users lack of confidence that the government is securing personal information
  • Inadvertent access information to third parties -  however, to date no known abuses have been reported

Testing that should have been done


At a minimum basic security testing should have been performed. However, for site with his much personally identifiable information, as is requested within the healthcare.gov website, more robust testing is called for and should have been done. This includes the typical white hat penetration type of testing plus normal security scans. The targets of such testing should have covered the following potential vulnerabilities:
  • Injection
  • Broken authentication and session management
  • Cross-Site Scripting
  • Insecure Direct Object References
  • Security Misconfiguration
  • Sensitive Data Exposure
  • Missing Function Level Access Control
  • Cross-Site Request Forgery
  • Using Components with Known Vulnerabilities
  • Unvalidated Redirects and Forwards

Summary

Clearly, the performance and security issues, just on the landing page alone, paint a very scary picture. While I have not met or spoken to any of the testers involved in testing any of the main components of the website it would appear that these groups either failed to perform adequate basic testing in these areas, were prevented from doing so due to project and program level constraints, or raised these issues and they fell on deaf ears. As is often quoted, “if you have time to fix it in production, you had time to fix it in development. “ I certainly hope that the powers to be take stock of the inadequacy of the website and either begins to perform more rigorous testing of the website before any prior rollouts or truly listens to the issues raised by internal teams. As a side note, there is a public debate going on, whether to close the website until it is fixed or allow it to remain while they continue to work on the site. While I am not eager to engage in public debate around the politics of many of the decisions that went into rolling out the website, it is clear that there are enough vulnerabilities that may warrant a bringing down of the site temporarily in order to fix some of these more critical issues and then bring the site back up.

In the next blog post we will examine some of the issues occurring at just the registration layer.


References used as the basis of this post:
*An anonymous user pointed out the spelling and grammar mistakes.  I apologize as I used Dragon Dictation to create this post - which was awesome - however, I posted the "pre"-proofread version instead of the corrected version.  Thanks for pointing it out.

3 comments:

Anonymous said...

Wow, it's ironic that a piece criticizing someone else's work contains so many basic grammar errors.

Joseph Ours said...

I've updated the post, thanks for pointing out I posted the wrong version. Sometimes we get in a hurry and make mistakes. I've been accused of not being human by my friends, LOL. Thanks for allowing me the opportunity to show that I am. :-)

Ed said...

Good analysis, Joe.

From a PM/BA's perspective, I heard there were some very big change requests that came down in the last two weeks. Probably no change management process in place for the proper impact analysis, code and test (including rehgression), etc. of these changes.

No matter what the expected traffic was. they should have phased in the system by establishing a schedule whereby only specific groups could log in the first week, the second week, etc. Example, those with SS numbers starting with 1 could log in the first week, those starting with 2, the second week, and so on. In that way, they could catch bugs sooner, not being overwhelmed with capacity problems.

Ed Barkley