The more time I’ve spent user testing, the more I’ve realized that determining “truth” about an interface is only one of its functions. User testing can indeed provide valuable quantitative findings from very small samples (contra those that describe it as purely “qualitative”) – yet generating empathy is another key function of user testing.
Each testing session is a sort of kabuki theatre, with facilitator and participants donning masks and performing roles for an audience. Like all good theatre, an aim of such a performance should be to remove the audience from its comfortable bubble and come to understand the feelings, frustrations and needs of others. A good user testing session should achieve this just as much as it provides hard data. I’ll outline my thoughts about this in the rest of this post.
There’s no substitute for real user data
UX designers create with a user in mind. Even when the designer may be representative of one of the user groups that a product targets, the breadth of most target markets is such that the designers will never be representative of all of the users. We therefore need to ensure two things.
First, we must ensure that designers have insight into these users – what they like and dislike, what they need and what they do; what they find easy and what they find difficult . Primary research can only provide so much of this insight, and as I will argue, the insights it provides can become lost as a project progresses. We need some way of preserving the empathy a designer should feel for the user, and user research (and particularly user testing) can provide this.
Second, there is of course a need to optimise hygiene factors in our work, such as usability. It doesn’t matter how epochal the concept is if users have difficulty signing up for a service, or if the copy is confusing. Every user journey is a series of vignettes, and the experience of each of these interactions must be optimised if the entire narrative is to be compelling. Whilst we should be careful to avoid merely listening to users and their preferences (lest we miss any unmet, unstated needs and so fail to innovate), we should be certain they can use and understand our designs as far as possible.
On large design projects, especially highly waterfall ones, user testing can get lost in the shuffle, or be added as a token review at the end of the project. Such projects typically proceed this way: when a creative project commences (or perhaps even at the pitch phase), background research will be conducted from which personas may be created. The background research could be primary – interviews or contextual work with target users – but may even just be secondary research such as quick desk research performed by the UX designer.
For most UX designers working on that project, this initial research is the most contact they have with real users. Gathered insights are transmitted in a number of forms, such as mental models, experience principles or personas. Such documents have their place, but their utility in helping users understand how customers will approach and respond to a certain interface is limited. In particular, personas are most often used as a stand-in for the “customer” as a project progresses. It could be said that personas help to generate empathy will real users, but they have questionable value when trying to understand customers.
Personas do not represent real users; they are chimeras, pieced together from bits and pieces of research. Statistical analysis of typical personas shows that they commit a fallacy of conjunction: as we make our personas more specific they may feel more descriptive, but actually they represent fewer and fewer users (Chapman et al, 2008). This is due to what Tversky and Kahneman (1974) termed the representativeness heuristic – what feels representative is not necessarily probable. Moreover, as Matthews et al (2012) discovered, few designers use personas for anything more than communication, instead turning to raw user data to inform their designs. As they note:
However, our results emphasize the critical importance of immersion in actual user data or, if time permits, exposure to users. User data is crucial in enabling designers to appropriately interpret personas or other abstractions. (Matthews et al, 2012)
Personas are used to inform design  when we miss the purpose of user research activities (including user testing). The point is not to create attractive deliverables nor is it to set up highly scientific studies – the point is to design something better, and exposure to customers is the best way to achieve this. As Microsoft’s Dennis Wixon puts it:
Unfortunately when considered from the viewpoint of designing real products in the real world [the number of participants needed in a user test] is the WRONG PROBLEM TO STUDY. The goal of most iterative tests in a commercial world is to produce the best possible product/design in the shortest time at the lowest possible cost with the least risk. (Barnum et al, 2003)
As Wixon notes, the same holds true for user testing: even if we feel we must expose UX designers to real users – and user testing is an excellent way to do so – we shouldn’t lose sight of why we are conducting these tests. We test to improve the user experience of our designs.
Summative testing is not enough
It seems designers need some exposure to users – and that personas are a poor way to do this. How should this be provided? A reliance on initial primary (or even, secondary) research doesn’t really help UX designers “get inside the heads” of customers. User testing can help them to understand user behaviour, but only if we conduct it with the right goals in mind. Most user testing involves having a representative sample of customers “think aloud” or explain their thought processes verbally whilst completing a set of defined tasks with an interface. As per Nielsen (1994), five participants is usually regarded as being enough to detect at least 80% of usability problems with an interface, though user tests following this “discount testing” approach will need to recruit more participants if they want to compare results between interface variants or detect the absolute rate at which issues occur.
Despite the advice being to test early, these studies will usually be conducted at the end of a design process. This summative testing reviews the near-complete design and identifies usability problems in the design, with quick fixes being made at the end of the development cycle and major changes being shifted to any future releases. These findings are typically relayed in a detailed PowerPoint report.
Such research is often conducted by external agencies at a business’s behest; there is a sense that the designers’ homework is being marked to ensure the design meets some user experience KPIs.
All in all, this is not an effective way to conduct user testing – neither for generating empathy nor for yielding robust data. The old “5 participants is enough” dictum is a simplistic bromide used to sell discount user testing. The number of participants needed to conclusively “uncover” the majority of usability issues depends upon factors such as the probability of a participant experiencing an issue, individual differences between participants, task complexity, the interface being tested and the probability that the usability expert(s) will detect the issue (Woolrych and Cockton, 2001). Five participants would not be enough to test the whole of Amazon, due to its sheer size; 5 participants would not be enough to test Dropbox, due to the heterogeneity of its uses.
Helpful sample size calculators are available to lessen some of these issues – such as Jeff Sauro’s, available here. These mitigate some of the uncertainty about testing sample size, but the fundamental issue remains – if you don’t know the problem occurrence, how do you know you’ve enough users to uncover all critical problems?
We could respond to this by demanding user testing only be conducted if it has scientific rigour and enough participants to uncover every issue possible. We could overshoot, doing away with discount testing and always testing with a large sample of users. I think this is again to miss the point of user research. As Wixon correctly observed, the point of testing is to improve a design within a set of temporal and financial constraints. To do this, we need to provide contact between designers and customer behaviour in order to generate design insight. By design insight, I mean an understanding by designers of the strengths and weaknesses of the designs they have created and some understanding of how they could be improved. This is what I mean by empathy – the ability to understand a designed object from another perspective, that of the users, and see the properties of that design through their eyes. The theatre of user testing delivers this.
I could test 10 participants and, if my methodology is incorrect, generate very little design insight. Providing ease of use ratings out of 5 for each feature might, for instance, help meet client KPIs but does not help a designer understand why a design failed nor how it could be improved. Observing just one participant, on the other hand, might achieve this. Whilst our user testing should have some internal rigour, our focus should be on understanding how to improve our designs, which we can do even without being (quasi-) experimental in our methods. We should not fixate on the number of participants that we test, but rather on what utility we can get from each participant that we test.
Of course, the inverse is true, and we need to avoid sloppy research under the auspices of generating empathy. We shouldn’t be afraid of numbers informing our design, and gathering certain key metrics – such as task success/failure, task times, etc. will likely help to improve the design. However, a dry report filled with numbers is unlikely to create insight into how the design can be improved – whilst the visceral feeling of observing actual users can drive this empathy and creativity. Ideally, we want a cost-effective approach that combines the two.
The consequence of this is that it’s better to provide timely design insights – even if the tests use a smaller sample or are more ad hoc – than it is to test at the end of a design process. The inimitable Bill Buxton considers design as a funnel (Buxton, 2007), as in Fig. 1 below.
Fig. 1 Buxton’s Design Funnel (From Buxton, 2007)
For Buxton, designers initially attempt to get the right design whilst ideating, before getting that design right by improving its usability, all through a series of iterations. Summative  user testing happens at the very end of this process. At best, such testing allows designers to fix some of the usability issues with their design, though due to time and budget constraints it is unlikely most identified issues will be fixed.
What UX designers need are insights from testing early and often during this formative stage of the design process. Early to ensure issues with the overarching concept, as well as major usability issues, are caught soon, preventing the designers taking an approach that simply doesn’t work. Issues can then be fixed within time and on budget. It’s when we follow the wrong path until it is too late to turn back that our work suffers; the sooner we are nudged in the right direction, the better. Testing should also happen often to ensure that each iteration of the design is informed by insights that come from real customers. The best design is brisk, iterative and avoids superfluous documentation; the best user testing should mirror this. The key is in the iteration, through which can ensure both empathy and some statistical rigour.
The RITE way to test
There are times when it is best for external teams or agencies to user test a design – the business may want final, external validation of that design, or perhaps the resources simply isn’t there for the design team to conduct some testing themselves. I won’t therefore make recommendation for conducting that, nor running for background research before the design work commences, since the correct approach here will vary massively between projects . What I do want to recommend is how to conduct formative user testing during the design phase of a project. There’s a simple, cheap, iterative user testing method that can be appropriated to greatly improve the quality of UX work.
An effective form of testing can be based on an approach from Medlock et al (2002) called Rapid Iterative Testing and Evaluation or RITE. This approach to testing involves conducting standard think-aloud testing with a user but then rapidly fixing identified issues – potentially even after testing with only one user. This means that the design is iterated and then retested through a number of cycles. Each main interface feature is continually iterated and retested until no more participants have a significant problem with that feature. This approach allows us to test each feature with a surprisingly large number of participants throughout the design cycle, by only testing it with a few at a time until no major issues remain. It also improves the impact ratio of the testing – more identified issues can be fixed. Moreover, it gives power to the UX designers to conduct and interpret research findings, and determine if an issue experienced by a few users calls for modification.
UX designers should test ~3 customers every couple of weeks throughout the project. Major features are tested each time and improved in the intervening fortnight. The number of iterations may need to be limited due to time/budget constraints and thus may not be able to completely solve all design problems identified. That aside, testing and tweaking a feature should continue until the designers are happy it has been optimised; such an approach also allows them to track how the designs are improving throughout the design process. The other key to this approach would be to keep it lean, relaying findings via email and standups as opposed to clunky and unenlightening findings reports.
Crucially, each feature would ultimately be tested with a good number of customers (potentially more than the “minimum” of 5) across iterations, and would be able to fix far more of the identified issues. Indeed, this would be closer to Nielsen’s (1994) original vision for testing. As Jeff Sauro notes:
The best strategy is to bring in some set of users, find the problems they have, fix those problems, then bring in another set of users as part of an iterative design and test strategy. In the end, although you’re never testing more than 5 users at a time, in total you might test 15 or 20 users. In fact, this is what Nielsen recommends in his article, not just testing 5 users in total.
However, the ultimate impact on project timelines and budget (a few hours of internal UX designer time every few weeks) would be less than typical, end-of-cycle testing (i.e. the cost would be for a week of an external agency’s time). Despite this, the approach still prioritises first-hand experience of a customer’s behaviour as providing better design insights compared to simply reading a report. This crucially requires that the UX designers – or at least closely affiliated internal researchers – conduct the testing, ensuring first-hand experience with users.
The sessions shouldn’t just include task-based testing however. Ideally, this testing should bookended by opening and closing interviews with the user. Such interviews don’t merely warm up or debrief the user, but are crucial engine for design insight, helping designers better understand the point of view, prejudices and daily routine of the user. Such interviews are anecdotal, so shouldn’t be considered “truth” in the same way that the testing perhaps can, but they can help the designer to escape their conceptual bubble. Note however that the interviews aren’t the only source of empathy or design insight – observation of users performing tasks with and interface is just as crucial a source.
A few more considerations about this approach
One weakness of this approach is that it prima facie lacks context or narrative – the context of use could be in-store, or in the street – how do we approximate this? Firstly, testing in a lab or office may have little impact on the number of usability issues that testing uncovers versus testing in the field (Kaikkonen et al, 2005). Second, and though this will require more planning, this approach certainly could work in the user’s context of use. For instance, the designers could arrange to test a prototype of an in-store app in-store. Third, they can simulate at least part of the customer journey in the sessions, be it using printed materials to simulate ads or testing both desktop and mobile interfaces in the same session. It will be possible to run each study to accommodate at least some of the non-linearity that typifies customer experience today.
It is also key that this testing is conducted for the benefit of the design work and is baked into design practice. Whilst client or business stakeholders should of course welcome to view the sessions, the ultimate purpose is to provide contact between UX designers and users. However, the data from the testing should help to inform the work – it should not completely drive it. There will be circumstances where UX designers may wish to disregard a user testing finding, unless a great many participants experience a particular usability issue. This is a critical point – user testing is a tool in design practice. Users may poorly understand novel or transcendent ideas, at least at first, and designers may need to be careful not to hobble their ability to innovate. The important point is to give the UX team control over these choices – and a measure of trust – and allow them to run this research approach themselves. My experience is that good UX designers aren’t half as wedded to particular design concepts as is often claimed, and are instead wedded to creating a good user experience.
Jon Innes correctly argues that user testing is quantitative, and that it’s a fear of numbers that leads many to see it as purely qualitative. Yet user testing is not purely quantitative, and it would be an impoverished view of the method to ignore the strong qualitative insights – as well as the even more abstract, emotional responses – that testing can deliver. An iterative testing approach such as RITE provides contact between users and designers throughout the design cycle. This delivers insights when the UX team needs them, empowering them to run and interpret their own formative research. Ultimately, however, it still involves testing with a statistically robust sample across all of the design interations, meaning good quantitative data can also be supplied. Finally, such an approach can also be implemented in a lean way, dispensing with the lumpen reports that typically follow tests. In such a way, testing can provide both truth and empathy.
 Even Apple, which famously closed its research activities when Steve Jobs returned did ensure regular contact between designers and the target customer; the target customer was always Jobs himself. Note that this approach may now be more difficult to maintain.
 This isn’t to say that personas have no uses. They are still useful communication tools (especially if clients expect or demand them), and the process of creating the personas, along with the immersion in consumer data that this entails, are certainly effective uses for personas. The problems arise when we ask designers to engage with them instead of real customers or with real customer data (which should be both quant and qual).
Bevan, N., Barnum, C., Cockton, G., Nielsen, J., Spool, J., Wixon, W. (2003). The “Magic Number 5”: Is It Enough for Web Testing? In: CHI Extended Abstracts, pp. 698–699. ACM Press, New York
Buxton, N. (2007). Sketching user experiences. San Francisco, CA: Elsevier
Chapman, C.N., Love, E., Milham, R.P., ElRif, P. & Alford, J.L. (2008). Quantitative evaluation of personas as information. In Proceedings of the Human Factors and Ergonomics Society 52nd Annual Meeting, pp. 1107–1111.
Kaikkonen, A., Kekäläinen, A., Cankar, M., Kallio, T., and Kankainen, A.,
(2005). Usability testing of mobile applications: a comparison between laboratory and field testing. Journal of Usability Studies, Vol. 1, 1, pp. 4-16.
Matthews, T., Judge, T., Whittaker, S., (2012). How Do Designers and User Experience Professionals Actually Perceive and Use Personas? In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 2012, pp. 1219-1228. ACM New York, NY,
Medlock, M.C., Wixon, D., Terrano, M., Romero, R . , Fulton, B. (2002). Using the RITE method to improve products: a definition and a case study. Proc. Usability Professionals Association (Orlando FL, July 2002).
Nielsen, J. (1994). Estimating the number of subjects needed for a thinking aloud test. International Journal of Human-Computer Studies, Vol. 41, 3, 385-397.;
Tversky, A. & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science. 185(4157). 1124-1131.
Woolrych, A., & Cockton, G. (2001). Why and when five test users aren’t enough. In J. Vanderdonckt, A. Blandford, & A. Derycke(Eds.), Proceedings of IHM-HCI 2001 Conference: Vol. 2, pp. 105-108. Toulouse, France: Cépadèus.