Brave NUI World

New technology promises a future in which users are able to interact naturally with technology. Unfortunately, this is commonly confused for a world in which technology will mimic some already-existing natural phenomenon. Here, the authors of the book Brave Nui World teach you that this future, while natural to the user, must be engineered by its designers – and how to do it. This blog serves to accompany the book: Brave NUI World: Designing Natural User Interfaces for Touch and Gesture.

Sunday, November 15, 2009

Indirect Touch is Better than Direct Touch (sometimes)

In the broadest sense, touch as a technology refers to the response of an input device without having to depress an actuator. The advent of technologies which allow displays and input devices to overlap have given rise to direct touch input. With direct touch, we apply the effect of a touch to the area of the screen beneath the finger(s). This gives rise to claims of ‘more natural’ computing, and enables certain types of devices to be better (such as mobile devices, like the Zune or a Windows Mobile phone). Direct touch, however, introduces a host of problems that make it ill suited to most applications.

What’s With Direct Touch?

I recently attended a presentation in which the speaker described a brief history Apple’s interaction design. The speaker noted that the iPhone’s touch interface was developed in response to the need for better input in the phone.

Phewy.

The iPhone’s direct-touch input was born of a desire to make the screen as large as possible. By eliminating a dedicated area for input, Apple could cover the whole of the front of the device with a big, beautiful screen (don’t believe me? Watch Jobs’ keynote again). The need to constantly muck-up your screen with fingerprints is a consequence of the desire to have a large display – not a primary motivator. Once we understand this, we understand the single most important element of direct touch: it allows portable devices to have larger screens. But it also has a litany of negative points:

Tiring for large and vertical screens (the Gorilla Arm effect)
Requires users to occlude their content in order to interact with it (leading to the ‘fat finger problem’.
Does not scale up well: human reach is limited
Does not scale down well: human muscle precision is limited

There is also some evidence that direct touch allows children to use computers at a younger age, that older novice computer users can use it more quickly. Direct-touch has also demonstrated utility in kiosk applications, characterized by short interaction periods and novice users. We on the Surface team also believe that direct-touch is suitable for certain scenarios involving sustained use. Clearly, direct-touch is far from being the be-all and end-all input technique – it has a whole host of problems that make it ill suited to a number of scenarios.

What’s with Indirect Touch?

Indirect touch overcomes most of the negatives of direct touch.

Scalable both up and down with C/D Gain (aka ‘mouse speed’ in windows – ‘touch speed’ in Windows 8?)
Enables engaging interactions in the same way direct-touch does
Fingers do not occlude the display when using
Not tiring if done correctly: with the right scale factor (C/D gain) a large area can be affected with small movements

Revolutionary Indirect Touch

Indirect touch enables a host of scenarios not possible with direct touch, many of which are much more compelling. Our first example <REDACTED FROM PUBLIC VERSION OF THE BLOG. MSFT EMPLOYEES CAN VIEW THE INTERNAL VERSION>

While exciting and low-tech, it should be noted that this demo is ‘single touch’ (not multi-touch). So, what would indirect multi-touch look like? Let’s take a look at three examples. The first is again from a teaming of MS Hardware and MSR: Mouse 2.0. Mouse 2.0 is a multi-touch mouse that enables multi-touch gestures on a mouse. Right around the time this research paper was published at ACM UIST, Apple announced their multi-touch mouse. While both are compelling, Mouse 2.0 offers several innovations that the Apple mouse lacks. Note the mixing of two different forms of indirect interaction: the mouse pointer and ‘touch pointer’:

Figure 2: Mouse 2.0 adds multi-touch capabilities to the world’s second most ubiquitous computer input device.

Indirect touch is also highly relevant for very large displays. In this system, a user sits at a table in front of a 20’ high-resolution display. Instead of having to walk across the room to use the display, the user sits and performs indirect touch gestures. The left hand positions the control area on the display, and the right hand performs touch gestures within that area. This demonstrates how indirect touch is far more scalable than direct touch, and also shows a good way to use two hands together:

Figure 3: Two-handed indirect touch interaction with very large displays.

Also for large displays: this time, televisions. An indirect touch project from Panasonic shows how we can get rid of the mass of buttons on a remote control: display only the buttons that you need at the moment. To allow this to be done cheaply, they built a remote control that has only two touch sensors (with actuators, so that you can push down to select). The visual feedback is shown on the screen:

Figure 4: Panasonic’s prototype remote control has no buttons: the interface is shown on the TV screen, customized for the current context of use.

Finally, the TactaPad, which demonstrates how indirect multi-touch interaction could be done to enable existing applications without the many drawbacks of direct touch. The TactaPad uses vision and a touch pad to create a device that can detect the position of the hands while in the air, and send events to targeted applications when the user touches the pad. While less subtle, this starts to show how a generic indirect multi-touch peripheral might behave in an existing suite of applications:

Figure 5: the TactaPad provides an indirect touch experience to existing applications.

From this grab-bag of indirect touch projects, we can start to see a pattern emerge: direct touch is appropriate for some applications, but is absolutely the wrong tool for others. Indirect touch fills the many voids left by direct touch. Indirect touch interaction allows users to operate at a distance, to scale their interactions, and to work in a less tiring way. If touch is the future of interaction, indirect touch is the way it will be achieved. Apple’s trackpad features a set of indirect touch gestures that their customers use every day to interact with their macbook. My girlfriend recently turned down a bootcamp install of Windows 7 because it doesn’t support her trackpad gestures:

These gestures are usable (natural? No such thing) and definitely useful. As always, the key to success is a real understanding of user needs and the fundamentals of available technologies. From there, we can begin to design software and hardware solutions to those needs in a way that provides an exciting, compelling, and truly useful experience.

Wednesday, September 30, 2009

The Myth of the Natural Gesture

This entry is meant to address a myth that we encounter time and again: the Myth of the Natural Gesture.

The term “Natural User Interface” evokes mimicry of the real world. A naïve designer looks only to the physical world, and copies it, in the hopes that this will create a natural user interface. Paradoxically to some, mimicking the physical world will not yield an interface that feels natural. Consider two examples to illustrate the point.

First, consider the GUI, or graphical user interface. Our conception of the GUI is based on the WIMP toolset (windows, icons, menus, and pointers). WIMP is actually based on a principal of ‘manipulation’, or physical movement of objects according to naïve physics. The mouse pointer is a disembodied finger, poking, prodding, and dragging content around the display. In short, if ‘natural’ means mimicking nature, we did that already – and it got us to the GUI.

Second, consider the development of the first generation Microsoft Surface gestures. The promise of the NUI is an interface that is immediately intuitive, with minimum effort to operate. In designing Surface gestures, we once constantly asked: “what is natural to the user?” This question is a proxy for “what gesture would the user likely perform at this point to accomplish this task?” In both cases, this is the wrong question. Here’s why:

Imagine an experiment intended to elicit the ‘natural’ gesture set – that set of gestures users would perform without any prompting. You might design it like this: show the user screen shots of your system before and after the gesture is performed. Then, ask them to perform the gesture that they believe would cause that transition. Here’s an example:

Figure 1. 3 different pairs of images: the screen before (left) and after (right) a gesture was performed. What gesture would you use? Almost all of you would perform the same gesture for example 1. About half of you would perform the same gesture for example 2. Example 3 is all over the map.

In two experiments conducted by teams at both Microsoft Surface and MSR, there was found almost no congruence between any user-defined gestures for even the simplest of operations (MSR work).

A solution to this lies in the realization that both studies were intentionally done free of context: no on-screen graphics to induce a particular behaviour. Imagine trying to use a slider from a GUI without a thumb painted on the screen to show you the level, induce behaviour (drag), and give feedback during the operation so you know when to stop. As we develop NUI systems, they too will need to provide different affordances and graphics.

So, if you find yourself trying to define a gesture set for some platform or experience, you should consider the universe you have built for your user – what do the on-screen graphics afford them to do? User response is incredibly contextual. Even in the real world, a ‘pick up’ gesture is different for the same object when it has been super-heated. If there is not an obvious gesture in the context you have created, it means the rules and world you have built is incomplete: expand it. But don’t try to fall back to some other world (real or imagined).

While the physical world and previous user experience provides a tempting target, a built-world with understood rules that will elicit likely responses, relying on that alone in building a user interface will be disastrous – because only one gesture is actually ‘natural’. Relying on some mythical ‘natural’ gesture will only yield frustration for your users.

Monday, June 15, 2009

Why are Surface and Windows 7 Gestures Different?

My commitments include driving towards a standardized gesture language across Microsoft. Given this, someone asked me recently: why are the gestures across our products not the same, and why are you designing new gestures for Surface that aren’t the same as those in Windows?

The answer is in two parts. The first is to point out that manipulations, a specific type of gesture, are standardized across the company, or in the process of being standardized. The manipulation processor is a control in a bunch of different platforms, including both Microsoft Surface and Windows 7. The differences lie in non-manipulation gestures (what I have previously called ‘system gestures’). So – why are we developing different non-manipulation gestures than are included in Win 7?

Recall: Anatomy of a Gesture

Recall from a previous post that gestures are made-up of three parts: registration, continuation, and termination. For the engineers in the audience, a gesture can be thought of as a function call: the user selects the function at the registration phase, they specify the parameters of the function during the continuation phase, and the function is executed at the termination phase:

Figure 1: stages of a gesture: registration (function selection), continuation (parameter specification), termination (execution)

To draw from an example you should all now be familiar with, consider the two-finger diverge (‘pinch’) in the manipulation processor – see the table below. I put this into a table in the hopes you will copy and paste it, and use it to classify your own gestures in terms of registration, continuation, and termination actions.

Gesture Name	Logical Action	Registration	Continuation	Termination
ManipulationProcessor/Pinch	Rotates, resizes, and translates (moves) an object.	Place two fingers on a piece of content.	Move the fingers around on the surface of the device: the changes in the length, center position, and orientation of the line segment connecting these points is applied 1:1 to each of scale (both height and width), center position, and orientation of the content.	Lift the fingers from the surface of the device.

Table 1: stages of multi-touch gestures.

Avoiding Ambiguity

The design of the manipulation processor is ‘clean’, in a particularly nice way. To understand this, consider our old friend the Marking Menu. For those not familiar, think of it as a gesture system in which the user flicks their pen/finger in a particular direction to execute a command (there is a variation of this in Win 7, minus the menu visual):

Figure 2: Marking menus

Keeping this in mind, let’s examine the anatomy of a theoretical ‘delete’ gesture:

Gesture Name

Logical Action

Registration

Continuation

Termination

Theoretical/Delete

Delete a file or element.

1. Place a finger on an item

2. Flick to the left

None.

Lift the finger from the surface of the device.

Table 2: stages of the WM_Gesture/Delete function.

Two things are immediately apparent. The first is that there is no continuation phase of this gesture. This isn’t surprising, since the delete command has no parameters – there isn’t more than one possible way to ‘delete’ something. The second striking thing is that the registration requires two steps. First, the user places their hand on an element, then they flick to the left.

Requiring two steps to register a gesture is problematic. First, it increases the probability of an error, since the user must remember multiple steps. Second, error probability is also increased if the second step has too small a space relative to other gestures (eg: if there are more than 8 options in a marking menu). Third, it requires an explicit mechanism to transition between registration and continuation phases: if flick-right is ‘resize’, how does the user then specify the size? Either it’s a separate gesture, requiring a modal interface, or the user will keep their hand on the screen, and require a mechanism to say ‘I am now done registering, I would like to start the continuation phase’). Last, the system cannot respond to the user’s gesture in a meaningful way until the registration step is complete, and so this prolongs this feedback. Let’s consider a system which implements just 4 gestures: one for manipulation of an object (grab and move it), along with 3 for system actions (‘rename’, ‘copy’, ‘delete’) using flick gestures. We can see the flow of a user’s contact in Figure 3. When the user first puts down their finger, the system doesn’t know which of these 5 gestures the user will be doing, so it’s in the state labelled ‘<ambiguous>’. Once the user starts to move their finger around the table in a particular speed and direction (‘flick left’ vs. ‘flick right’) or pattern (‘slide’ vs. ‘question mark’), the system can resolve that ambiguity, and the gesture moves into the registration phase:

Figure 3: States of a hand gesture, up to and including the end of the Registration phase – continuation and termination phases are not shown.

Gesture Name	Logical Action	Registration	Continuation	Termination
Theoretical/Rename	Enter the system into “rename” mode (the user then types the new name with the keyboard).	1. Place a finger on an item 2. Flick the finger down and to the right	None.	Lift the finger from the surface of the device.
Theoretical/Copy	Create a copy of a file or object, immediately adjacent to the original.	1. Place a finger on an item 2. Flick the finger up and to the left	None.	Lift the finger from the surface of the device.
Theoretical/Delete	Delete a file or element.	1. Place a finger on an item 2. Flick the finger up and to the right	None.	Lift the finger from the surface of the device.
ManipulationProcessor/Move	Change the visual position of an object within its container.	1. Place a finger on an item 2. Move the finger slowly enough to not register as a flick.	Move the finger around the surface of the device. Changes in the position of the finger are applied 1:1 as changes to the position of the object.	Lift the finger from the surface of the device.

Table 3: stages of various theoretical gestures, plus the manipulation processor’s 1-finger move gesture.

Consequences of Ambiguity

Let’s look first at how the system classifies the gestures: if the finger moves fast enough, it is a ‘flick’, and the system goes into ‘rename’, ‘copy’, or ‘delete’ mode based on the direction. Consider now what happens for the few frames of input while the system is testing to see if the user is executing a flick. Since it doesn’t yet know that the user is not intending to simply move the object quickly, there is ambiguity with the ‘move object’ gesture. The simplest approach is for the system to assume each gesture is a ‘move’ until it knows better. Consider the interaction sequence below:

Figure 4: Interaction of a ‘rename’ gesture: 1: User places finger on the object. 2: the user has slid their finger,
with the object following-along. 3: ‘rename’ gesture has registered, so the object pops back to its original location

Because, for the first few frames, the user’s intention is unclear, the system designers have a choice. Figure 4 represents one option: assume that the ‘move’ gesture is being performed until another gesture gets registered after analyzing a few frames of input. This is good, because the user gets immediate feedback. It’s bad, however, because the feedback is wrong: we are showing the ‘move’ gesture’, but they’re intending to perform a ‘rename’ flick. The system has to undo the ‘move’ at the time of registration of rename, and we get an ugly, popping effect. This problem could be avoided: provide no response until the user’s action is clear. This would correct the bad feedback in the ‘rename’ case, but consider the consequence for the ‘move’ case:

Figure 4: Interaction of a ‘move’ gesture in a thresholded system: 1: User places finger on an object. 2: slides finger along surface.
The object does not move because the ‘flick’ threshold is known to have not been met. 3: object jumps to catch-up to the user’s finger.

Obviously, this too is a problem: the system does not provide the user with any feedback at all until it is certain that they are not performing a flick.

The goal, ultimately, is to avoid the time that the user’s intention is ambiguous. Aside from all of the reasons outlined above, it also creates a bad feedback situation. The user has put their finger down, and it has started moving – how soon does the recognizer click over to “delete” mode, vs. waiting to give the user a chance to do something else? The sooner it clicks, the more likely there will be errors. The later it clicks, the more likely the user will get ambiguous feedback. It’s just a bad situation all around. The solution is to tie the registration event to the finger-down event: as soon as the hand comes down on the display, the gesture is registered. The movement of the contacts on the display is used only for the continuation phase of the gesture (ie specifying the parameter).

Hardware Dictates What’s Possible

Obviously, the Windows team are a bunch of smart cats (at least one of them did his undergrad work at U of T, so you know it’s got to be true). So, why is it that several of the gestures in Windows 7 are ambiguous at the moment of contact-down? Well, as your mom used to tell you, ‘it’s all about the hardware, stupid’. The Windows 7 OEM spec requires that hardware manufacturers only sense two simultaneous points of contact. Further, the only things we know about those points of contact are their x/y position on the screen (see my previous article on ‘full touch vs. multi-touch’ to understand the implications). So, aside from the location of the contact, there is only 1 bit of information about the gesture at the time of contact down – whether there are 1 or 2 contacts on the display. With a message space of only 1 bit, it’s pretty darn hard to have a large vocabulary. Of course, some hardware built for Win7 will be more enabled, but designing for all means optimizing for the least.

In contrast, the next version of Microsoft Surface is aiming to track 52 points of contact. For a single hand, that means we have 3 bits of data about the number of contacts. Even better, we will have the full shape information about the hand – rather than using different directions of flick to specify delete vs. copy, the user simply pounds on the table with their closed fist to delete, and full-hand diverge to copy (as opposed to two-point diverge which is “pinch”, as described above). It is critical to our plan that we fully leverage the Microsoft Surface hardware to enable a set of gestures with fewer errors and a better overall user experience.

How will we Ensure A Common User Experience?

Everyone wants a user experience that is both optimal and consistent across devices. To achieve this, we are looking to build a system that will support both the Win 7 non-manipulation gesture language, as well as our shape-enabled system to navigate. The languages will co-exist, and we will utilize self-revealing gesture techniques to transition users to the more reliable Surface system. Further, we will ensure that the surface equivalents are logically consistent: users will learn a couple of rules, and then simply apply those rules to remember how to translate Win 7 gestures into Surface gestures.

Our Team's Work

Right now, we are in the process of building our gesture language. Keep an eye on this space to see where we land.

Wednesday, April 15, 2009

What Gesture do I Use for “MouseOver”?

Recently, a PM on my team asked me this question – he was working with a group of people trying to make an application for Microsoft Surface, who had designed their application, and suddenly realized that they had assigned reactions for ‘MouseOver’ that could now not be mapped to anything, since Surface does not have a ‘hover’ event.

Boy, was this the wrong question!

What we have here is a fundamental misunderstanding of how to do design of an application for touch. It’s not his fault – students trained in design, cs, or engineering build applications against the mouse model of interaction so often, they become engrained assumptions about the universe in which design is done. We have found this time and again when trying to hire onto the Surface team: a designer will present an excellent portfolio, often with walkthroughs of webpages or applications. These walkthroughs are invariably constructed as a series of screen states – and what transitions between them? Yep – a mouse click. The ‘click’ is so fundamental, it doesn’t even form a part of the storyboard – it’s simply assumed as the transition.

As we start to build touch applications, we need to start to teach lower-level understanding of interaction models and paradigms, so that the design tenets no longer include baseline assumptions about interaction, such as the ‘click’ as transition between states.

Mouse States and Transitions vs. Touch States & Transitions

Quite a few years ago, a professor from my group at U of T (now a principal researcher at MSR), published a paper describing the states of mouse interaction. Here’s his classic state diagram, re-imagined a little with different terminology to make it more accessible:

Figure 1. Buxton’s 3-state model of 1-button mouse hardware

This model describes the three states of the mouse. State 0 is when the mouse in the air (not tracking – sometimes called ‘clutching’, often done by users when they reach the edge of the mouse pad). State 1 is when the mouse is moving around the table, and thus the pointer is moving around the screen. This is sometimes called “mouse hover”. State 2 is when a mouse button has been depressed. Technically, it’s possible to transition directly between states 0 and 2 (by pressing a button when the mouse is in the air).

As Bill often like’s to point out, the transitions are just as, if not more important than the states themselves. It is the transitions between states that usually evoke events in a GUI system: transition from state 1 to 2 fires the “button down” event in most systems. 2-1 is “button up”. A 1 à 2 à 1 transition is a “click”, if done with no movement and in a short period off time.

Aside for the real geeks: the model shown in Figure 1 is a simplification: it does not differentiate between different mouse buttons. Also, it is technically possible to transition directly between states 0 and 2 by pushing the mouse button while the mouse is being held in the air. These and many other issues are addressed in the paper – go read it for the details, there’s some real gold.

The problem the PM was confronting becomes obvious when we look at the state diagram for a touch system. Most touch systems have a similar, but different state diagram:

Figure 2. Buxton’s 2-state model of touch input: state names describe similar states from the mouse model, above.

As we can see, most touch systems have only 2-states: State 0, where the finger is not sensed, and State 2, where the finger is now on the display. The lack of a State 1 is highly troubling: just think about all of the stuff that GUI’s do in State 1: the mouse pointer gives a preview of where selection will happen, making them more precise. Tooltips are shown in this state, as are highlights to show the area of effect when a mouse is clicked. I like to call these visuals “pre-visualizations”, or visual previews of what the effect of pushing the mouse button will be.

We saw in my article on the Fat Finger Problem that the need for these pre-visualizations is even more pronounced on touch systems, where precision of selection is quite difficult.

Aside for the real geeks: the model of a touch system described in Figure 2 is also a simplification: it is really describing a single-touch device. To apply this to multi-touch, think of these as states of each finger, rather than of the hardware.

There are touch systems which have a ‘hover’ state, which sense the location of fingers and/or objects before they touch the device. I’ll talk more about the differences in touch experiences across hardware devices in a future blog post.

Touch and Mouse Emulation: Doing Better than the Worst (aka: Why the Web will Always Suck on a Touch Screen)

We’ll start to see a real problem as there are increasingly two classes of devices browsing the web at the same time: touch, and mouse. Webpage designers are often liberal in their assumptions of the availability of a ‘mouse over’ event, and there are examples of websites that can’t be navigated without it (Netflix.com leaps to mind, but there are others). It is in these hybrid places that we’ll see the most problems.

When I taught this recently to a class of undergrads, one asked a great question: the trackpad on her laptop also had only two states, so why is this a problem for direct-touch, but not for the trackpad? The answer lies in understanding how the events are mapped between the hardware and the software. In the case of a trackpad, it emulates a 3-state input device in software: the transitions between states 1 and 2 are managed entirely by the OS. It can be done with a physical button beside the trackpad (common), or operated by a gesture performed on the pad (‘tapping’ the pad is the most common).

In a direct-touch system, this is harder to do, but not impossible. The trick lies in being creative in how states of the various touchpoints are mapped onto mouse states in software. The naïve approach is to simply overlay the touch model atop the mouse one. This model is the most ‘direct’, because system events will continue to happen immediately beneath the finger. It is not the best, however, because it omits state 1 and is imprecise.

The DT Mouse project from Mitsubishi Electric Research Labs is the best example of a good mapping between physical contact and virtual mouse states. DT Mouse was developed over the course of several years, and was entirely user-centrically designed, with tweaks done in real time. It is highly tuned, and includes many features. The most basic is that it has the ability to emulate a hover state – this is done by putting two fingers down on the screen. When this is done, the pointer is put into state 1, and positioned between the fingers. The transition from state 1 to 2 is done by tapping a third finger on the screen. An advanced user does this by putting down her thumb and middle finger, and then ‘tapping’ with the index finger:

Figure 3. Left: The pointer is displayed between the middle finger and thumb.
Right: the state 1à transition is simulated when the index finger is touched to the display

So, now we see there are sophisticated ways of doing mouse emulation with touch. But, this has to lead you to ask the question: if all I’m using touch for is mouse emulation, why not just use the mouse?

Design for Touch, Not Mouse Emulation

As we now see, my PM friend was making a common mistake: he started by designing for the mouse. In order to be successful, designers of systems for multi-touch applications should start by applying rules about touch, and assigning state changes to those events that are easily generated using a touch system.

States and transitions in a touch system include the contact state information I have shown above. In a multi-touch system, we can start to think about combining the state and location of multiple contacts, and mapping events onto those. These are most commonly referred to as gestures – we’ll talk more about them in a future blog post.

For now, remember the lessons of this post:

a. Re-train your brain (and the brains of those around you) to work right with touch systems. It’s a 2-state system, not a 3-state one.
b. Mouse emulation is a necessary evil of touch, but definitely not the basis for a good touch experience. Design for touch first!

Our Team’s Work

The Surface team is tackling this in two ways: first, we’re designing a mouse emulation scheme that will take full advantage of our hardware (the Win 7 effort, while yielding great results, is based on more limited hardware). Second, we are developing an all-new set of ‘interaction primitives’, which are driven by touch input. This means the end of ‘mouse down’, ‘mouse over’, etc. They will be replaced with a set of postures, gestures, and manipulations designed entirely for touch, and mapped onto events that applications can respond to.

The first example of this is our ‘manipulation processor’, which maps points of contact onto 2-dimensional spatial manipulation of objects. This processor has since been rolled out into Win 7, windows mobile, and soon into both .NET and WPF. Look for many more such contributions to NUI from our team moving forward.