Background
By now everybody has heard of the healthcare.gov website
rollout. By all accounts this initial rollout has been a colossal failure. In
this multi-part blog post we will examine some of the key issues that have caused this rollout to go awry.
In a software project of this size it’s no surprise that there are issues with the initial rollout. However the magnitude of these issues has been
quite surprising to IT professionals and the layperson alike. To give some
context, let’s look at some key findings of large-scale IT projects. In 2012,
Standish Group International conducted a survey of large-scale IT projects.
They found that projects valued over $10 million that less than 10% of those
projects were completed successfully - meaning on time and within budget. To bring this a
little closer to home, the survey found that 48% of all federal IT projects had
to be re-baselined. In layman’s terms this means that they had to restructure
the projects because of cost overruns and/or a change in project goals. Of the projects that had to be re-baselined, it interesting to note that more than half had
to be re-baselined more than one time. This does not bode well for a federally run IT project. The department of Health and Human Services history is even
more troublesome. As of 2008, 43% of their projects were on the Office of
Management and Budget’s “watchlist” because of poor performance and other
management issues.
Before we can evaluate what went wrong with the website rollout, we
have to evaluate what its purpose was. By most accounts the purpose of the
website is to provide a centralized location through which folks seeking health
insurance can secure health insurance. By the Department of Health and Human
Services estimates they are looking to serve 47 million uninsured customers
however they set a much lower goal of only actually enrolling 7 million by March 1, 2014.
First Reported Symptoms
Before we dive into specific problems, I do want to point
out some notable headlines that made the news. Most people seem to assume that
the problems with the website deal solely with scalability and performance;
however, actual site traffic tells a different story. In the first week of the
website went live, they had 9.47 million users. Of those who visited the site, many of them experience page wait times of greater than 8 seconds, and less
than 3% of them could actually create an account. In a 2009 Aberdeen study,
they found that 40% of users will abandon a website after waiting as little as
2 seconds. Within 8 second wait time, an inability to create an account;
it is of no surprise that the website is failing to meet expectations.
So the big question is, what went wrong with the site, and
what could have been done to prevent it?
Before we can detail specific defects within the website is
important to point out the general process flow that users are expected to
follow when using the website. That flow is depicted below:
As you can see, users are expected to go to the
healthcare.gov landing page. This is the page that really kicks off what the
user can and should do next. Primarily the user should create an account
through the registration process. This includes setting a user ID and password
and providing other pieces of personal information. It also includes an email verification step. From
there they go through a process to verify their identity which involves calls
to credit bureaus. From this point, eligibility for government subsidies is
determined, which is largely a background operation, and then the user can request
healthcare quotes. Once they receive quotes, they can then choose coverage and complete the enrollment and payment processes. Once confirmed with the insurance
company, they then have coverage. Obviously, this is a simplified version of the
overall process but one that most people can understand and identify with. We
will detail some of these defects of various steps in this process in this and
future blog post.
The Landing Page
Performance
Most public of these issues with the healthcare government website
have been performance related. As we have already pointed out, users were
experiencing and 8+ second wait time. The government has publicly stated that
the hardware was sized to handle between 50k and 60k concurrent users. The
government has also stated that the code was designed on the Medicaid Part D
rollout which was designed to handle 30k concurrent users. Meanwhile, the
government says they actually experienced 250k concurrent users. These
conflicting statements begin to imply that performance was a concern but still
treated as an afterthought in the software development process. As a side note
I should point out that when websites fail they tend to be capable of handling
more concurrent users due to the simple fact that they are not processing what they
should - and in this case returning a static error page is far easier on the servers
than actually processing requests. In addition it is doubtful that there was a
standard definition of a concurrent user for this particular project. So it is
of no surprise that the government is reporting five times as many users as
expected.
As a side not, and solely based on observation, certainly uncorroborated, it would
appear the government implemented a waiting room. This waiting room served as a
controlled access point to limit the number of concurrent users actually
exercising the servers. With this waiting room, users were not given the
visibility to queue lengths or wait times. Often times, users exited and
re-entered the queue during the registration process. This may be some of the
reasons for such the low registration rate.
Performance Summary
- 8+ second wait times
- Time outs, errors, etc…
- Software designed for 30k concurrent users
- Hardware designed for 60k concurrent users
- “Waiting Room” had to be implemented
The impact
- The inability to register an account
- The inability to review coverages
- The addition of people to process paper applications (8000 in the first week)
- Leading story on all major news outlets
- Trending topic on multiple social media outlets
Testing that should have been done
Obviously, with a site of this size and complexity, comprehensive performance testing needed to be done. However, it is apparent
from the government’s own reports that adequate performance testing was never
performed. In fact, the first end-to-end test occurred only one month prior to the go
live date. During this end to end test, they discovered over 200 additional defects
and crashed the system multiple times.
It cannot be stated enough, testers need adequate test
environments to perform testing. For performance testing, these environments
either need to be comparable to production or they need constructed in such a
way as to be analogous to production. Not only that, environment stability is fundamental
to executing a good performance test.
Security
Security of the healthcare website is of particular concern
in the testing and the IT security world; however it is the most underreported
failure of the website. Fortunately the inability to create accounts is going
to prove a blessing in disguise for the government. The sheer volume of security
vulnerabilities in the website is truly astounding. In fact there are so many
critical security vulnerabilities in the website that it is hard to label just one as
the most critical. While I’m not a big fan of using “best practices”, there are
certainly bad practices to avoid. The first of those being, storing email addresses
and passwords unencrypted within the header traffic of the website. Second, I’d
have to say it is sending email addresses and password reset tokens to third
parties. And finally, it would be the 471 pieces of identifiable information
that they pass back and forth between every webpage on the website including:
birthdate, address, name, date of marriage, noncustodial parent information,
absent parent information, etc…
Security Summary
These are just some of the most egregious security errors on
the website, however there are more. A summary of some of the critical security
errors is listed below:
- 471 identifiable pieces of information stored in the browser for every web page including birth date, address, name, marriage date, non-custodial parent information, absent parent information, etc…
- Email addresses and password reset codes to 3rd parties
- Using a single password reset code per account versus random generated code for each reset request
- Ability to answer security questions by “ex’s” – for example, your favorite radio station
- Displaying full stack trace errors to users
- Displaying of the username when providing email address input
- Displaying of a user's security questions when providing valid user id
The impact
- Fortunately there is not been any disclosures of personally identifiable information.
- There is a lack of security around personally identifiable information
- Users lack of confidence that the government is securing personal information
- Inadvertent access information to third parties - however, to date no known abuses have been reported
Testing that should have been done
At a minimum basic security testing should have been
performed. However, for site with his much personally identifiable information, as is requested within the healthcare.gov website, more robust testing is called
for and should have been done. This includes the typical white hat penetration
type of testing plus normal security scans. The targets of such
testing should have covered the following potential vulnerabilities:
- Injection
- Broken authentication and session management
- Cross-Site Scripting
- Insecure Direct Object References
- Security Misconfiguration
- Sensitive Data Exposure
- Missing Function Level Access Control
- Cross-Site Request Forgery
- Using Components with Known Vulnerabilities
- Unvalidated Redirects and Forwards
Summary
Clearly, the performance and security issues, just on the
landing page alone, paint a very scary picture. While I have not met or spoken
to any of the testers involved in testing any of the main components of the
website it would appear that these groups either failed to perform adequate
basic testing in these areas, were prevented from doing so due to project and
program level constraints, or raised these issues and they fell on deaf ears.
As is often quoted, “if you have time to fix it in production, you had time to
fix it in development. “ I certainly hope that the powers to be take stock of
the inadequacy of the website and either begins to perform more rigorous
testing of the website before any prior rollouts or truly listens to the issues
raised by internal teams. As a side note, there is a public debate going on,
whether to close the website until it is fixed or allow it to remain while they
continue to work on the site. While I am not eager to engage in public debate
around the politics of many of the decisions that went into rolling out the
website, it is clear that there are enough vulnerabilities that may warrant a
bringing down of the site temporarily in order to fix some of these more
critical issues and then bring the site back up.
In the next blog post we will examine some of the issues
occurring at just the registration layer.
References used as the basis of this post:
*An anonymous user pointed out the spelling and grammar mistakes. I apologize as I used Dragon Dictation to create this post - which was awesome - however, I posted the "pre"-proofread version instead of the corrected version. Thanks for pointing it out.