SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Downloaden Sie, um offline zu lesen
Multimodal and Affective Human Computer Interaction
By Abhinav Sharma (aus2101@columbia.edu)
Introduction
Until the 1970s, the only “humans” who “interacted” with computers were technology specialists and dedicated
hobbyists. The advent of the personal computing era, which brought along with it text editors, games and graphic
based operating systems aimed to equip the “average human” with superhuman capabilities. However, at that
time, there was very little focus on the usability aspect of hardware and software. Fortunately, there were a
wonderful series of events taking place in parallel. Communities for the broad areas of study in cognitive science,
artificial intelligence, linguistics, cognitive psychology and anthropometry were being formed to address this
usability issue between man and machine. Hence, HCI was developed as one of the first examples of cognitive
engineering. These communities greatly helped shape the computing experiences that we have today.
As technology progressed and these usability issues were brought to light, attempts were made to make
computers more user friendly for daily users. Command line inputs were replaced with graphical inputs, search
features were added to fetch applications instead of explicitly navigating to their location, common user controls
were used and standardized overtime to provide a consistent interactive experience to users.
More recently, human computer interaction has become an increasingly important field of interest amongst
technology and product companies. The industry has seen a shift in customer loyalty, loss in market share and
strong culture of disgust in companies that have been dormant about incorporating human centered designs in
their products. From CLIs, Traditional GUIs and Touch Enabled UIs to Voice User Interfaces (VUIs) and holographic
computing; the world has seen a rapidly changing environment which is redefining the way we do things on a daily
basis. Now, more than ever, the marriage of product-software design with the user’s intuition and cognitive
expectation is paramount when it comes to delivering a successful user experience. Platform companies like
Microsoft have made it as a part of their mission to provide a consistent user experience which is independent of
the device they use – whether it be a Surface tablet, or a Windows PC or an augmented reality experience using
Hololens.
There are a couple of interesting aspects about human computer interaction – multimodality and affection.
Windows 10 PCs are integrated with a personalized VUI like Cortana which allows the user to not only interact with
the computer by using a traditional QWERTY keyboard, but also allows inputs using voice enabled commands. This
represents multimodality – or multiple ways to interact with a machine. However, this is not the true definition of
multimodal as these interactions are mutually exclusive and are currently not performed in an uninterrupted
synchronized way. A true multimodal interaction will attempt to bridge these interactions in a seamless way to
achieve a natural method of communication with the machine. In this paper, I will try and expound on the current
research in multimodal HCI and express the progress, possibilities and limitations in this field. The second area of
interest is affective HCI. This field is rather absurd. It tries to answer questions like – can the computer understand
human emotions like anger, happiness etc.? Can human behavior and computer interaction be influenced by
music? There is limited research in this field at this time; however, I will attempt to give my views on it and present
further possibilities that can be predicted in this domain.
These articles will be prefaced by a general introductory article that introduces the different styles that exist in HCI.
This article shows the status quo on the current application environment in computing today.
Shneiderman, B. (n.d.). We can design better user interfaces: A review of human-computer interaction styles.
Ergonomics, 699-710.
This article provides the reader with an introduction to the term ‘Human Computer Interaction’ and the different
modes that exist within it. Although the article dates back to 1988, it provides a foundation in the understanding of
HCI and recommends three pillars to support the user interface design process. Along with the article, I will
provide my views on the updates in the field to bridge the article with the developments that have taken place in
the past years.
The author begins the article by heralding the next decade as the ‘golden age of ergonomics’, given the
developments in HCI at the time the article is written. He differentiates physical ergonomists from cognitive
ergonomists and states that the latter have made considerable development in areas such as screen layout,
graphic design, color choices and knowledge organization. He then discusses the 3 pillars of user interface
development. The first of these are ‘the guidelines documents’ or rules for interface design. The next 2 pillars are
UI Management systems (prototyping tools, graphic tools) and usability laboratories for iterative testing. He states
that certain metrics for testing include time for user to complete tasks, speed and performance on benchmark
tasks, rate and distribution of errors, subjective satisfaction and user’s retention of syntax and semantics of the
tool. It is quite surprising to see how accurate and comprehensive those metrics are, given that they capture nearly
all performance metrics used in modern day testing.
Next, the author discusses the taxonomy of interaction styles. The 5 classes discussed are – menu selection, form
fill in, command language (CLI), Natural language interaction (voice UI), Direct Manipulation (mouse, touch UI).
Though most of these modes still exist today, there have been some additions as well. For example, motion
tracking interfaces monitor the user body motions and translate them into commands. As we will see later, there
exist Perceptual UIs. GUIs are fairly common with Operating systems being a major example. Other common
interfaces are holographic interfaces (Hololens), gaze trackers and Natural language interface (Google). There can
also be a combination of one or more of these unimodal interfaces, which leads us to multimodal interfaces. These
will be discussed later in detail.
The author provides a detailed description for each of the five types of interaction styles along with their
advantages and disadvantages. The discussion of each of these styles is not very fruitful as they are trivial by
today’s standards. Hence, we will focus our discussion on factors that influence the choices of these interfaces – an
issue highlighted by the author that is still prevalent today. The author states that intermittent and expert users of
UIs would prefer abbreviations and shortcuts to perform tasks when compared to novices, who need meaningful
labels and standard established procedures. However, the key question now becomes – which user do we develop
the UI for? For a novice user, a simple touch UI (optimized for quick task accomplishment) can seem intimidating
whereas it may seem trivial for expert users. This may lead to resistance in adaptability of new technology and
interfaces by certain user segments. The real challenge now for companies is to help these novices to cross the
learning curve to embrace new technology. To make matters worse, due to the constantly evolving nature of the
industry, users have to constantly learn new things. Another major issue for technology companies is how to make
earlier inefficient tasks on the computer much easier for the user today. Should the designers get rid of old and
inefficient ways of achieving a task? How will the users react if certain features are made unavailable in the next
product (app) release? Will they switch to a competitor’s product? If all features are retained, the complexity of
the product itself may increase, which in turn makes the product unusable. All these design decisions are
paramount at an early stage to avoid disaster later.
Turk, M. (n.d.). Multimodal interaction: A review. International Journal of Signal Processing, Image Processing
and Pattern Recognition IJSIP, 189-195.
This article is a great summary of the current research that has occurred in the field of multimodal HCI. The author
begins by stating that all human interaction is inherently multimodal as we employ multiple senses – both in
parallel and in series to actively explore our environment and perceive new information. In contrast, the author
argues that HCI has historically focused on unimodal communication, which is inherently a less natural way to
communicate. We can, for instance only type on our keyboard to act as an input for the screen. While technically
we have attempted to interact with computers on a multimodal dimension – for example, using the mouse and
keyboard as simultaneous inputs, we are far from achieving multimodality in its true sense. Multimodal HCI
attempts to present human communication abilities – primarily in the form of speech, gesture, touch and facial
expression – in more sophisticated pattern recognition styles and classification methods.
The author talks about the origins of multimodal HCI and the early works that have been done on it. Richard Bolt
has been cited to be a key figure in conducting early experiments at the MIT media lab to bring multimodal HCI to
life with his “Put That There” system. Essentially, the system integrated voice and gesture inputs to enable a user
sitting in a chair to have a natural interaction with a wall display in the context of a spatial data management
system. For example, one could suggest commands like “Create a blue circle there” or “Move the square on the
right of the circle” to this system to get the desired outputs.
The author then talks about the different advantages of multimodal HCIs. The following is a subset of those along
with some thought provoking questions. One advantage is the increase in task efficiency. Although this may be
seen as an advantage in the long term, but what about the learning curve involved in the beginning? I argue that
humans might be slower in their interaction initially as they would simply not be very used to interacting with the
machine multimodally. It will be paramount to make these multimodal interactions seem natural and unobtrusive
to truly engage humans in this kind of interaction. The author also states that information is processed faster and
better when presented in multiple modes. I wonder if the increase in efficiency of the system is significant and if
so, then is this increase significant for all forms of multitasking combinations of senses? For example, humans can
see and hear simultaneously – they do that job very well, however, can they type and speak (2 forms of inputs)
with an increased efficiency when compared to typing or speaking alone? Those 2 sets of inputs may actually seem
like a less natural way of communicating for humans. Other advantages include – flexible and integrated use of
input modes, greater precision of display in spatial information, accommodation of wider range of users and
environmental situations, involvement of handicapped users (for example, blind people can use multimodal
interaction through speech, gestures)
The author then discusses a set of multimodal myths and design guidelines that would be useful in designing
systems. Some myths that require further research to debunk include – If you build a multimodal system, users will
interact multimodally. For example, in an I phone, both touch input and Siri are enabled. However, only a small
subset of all I phone users actually use both. Another myth is that efficiency, error resolution and user satisfaction
may not always increase with multimodal systems. Yet another is that many error prone recognition technologies
may combine multimodally to produce even greater unreliability. Some guidelines discussed are as follows –
multimodal systems should be designed for a broad range of users and contexts so that their advantage can be
realized. Privacy and security issues should be considered seriously in multimodal systems. For example, non-
speech alternatives should be available in a public context where the user is required to input confidential
information. Multimodal interfaces should be customized to adapt to the needs and abilities of all users using user
profiles. For example, machine learning algorithms can be employed to better understand individual user context.
Finally, system outputs, switching and presentation should be consistent. These are also the biggest challenges
ahead for multimodal HCI.
Finally, the author talks about integration. He argues that some modal combinations are intended to be
interpreted in parallel, while others sequential. He discusses the classification of multimodal interfaces in a 2x2
matrix on the basis of their fusion method (combined or independent) and use of modalities (sequential or
parallel). This gives rise to 4 kinds of systems – exclusive, alternative, concurrent and synergetic multimodal
systems. For example, in exclusive systems, modalities are used sequentially and are not integrated by the system.
The author finally discusses early vs integration models in multimodal systems. For example, after streams of data
have been input from different modes, should the data be interpreted uni-modally before being integrated (late
integration) or vice versa (early integration)?
Pantic, M., Sebe, N., Cohn, J., & Huang, T. (n.d.). Affective multimodal human-computer interaction. Proceedings
of the 13th Annual ACM International Conference on Multimedia - MULTIMEDIA '05.
The authors start the discussion by raising a valid concern for AM-HCIs. They claim that HCI design was first
dominated by direct manipulation and then by delegation. However, the tacit assumption of both approaches has
been that humans will be explicit, unambiguous and fully attentive while controlling information and command
flow. This is a major impediment to having flexible machines capable of adapting to users’ levels of attention,
preferences, moods and intentions. The authors however feel that there is tremendous potential in AM-HCI given
the range of application domains it covers. For example, it can be used for automatic affective assessment of
boredom, inattention and stress in jobs with high risks such as air traffic control, nuclear power plant control and
vehicle control. It can be applied to specialized professional fields where behavioral cues are important, such as lie
detection in police agencies. It can also hugely impact behavioral sciences, neurology and research.
The problem domain for AM-HCI systems can be summarized as follows – what is an affective state? How can they
be accurately represented? Which human communicative signals convey information about affective states? Are
facial expressions enough or do we need to analyze body gestures as well? How to best integrate information
across modalities for emotion recognition? How can this information be accurately quantified? The authors
present a number of experiments that have been conducted that ultimately suggest an ongoing research in each of
these thought provoking domains. The authors converge on the definition of the capabilities of an ideal automatic
human affect analyzer. The ideal system should be multimodal and should produce robust and accurate
estimations of emotions despite occlusions, changes in viewing and lighting conditions and in the presence of
ambient noise. Another capability of the system is genericity – it should be independent of the age, sex or ethnicity
of the subject. It should also be sensitive to the dynamics of the displayed affective expressions and context i.e. it
should be able to perform temporal analysis on the sensed data while realizing the current environment.
The current status quo has been discussed for facial and vocal affect analyzers. Current facial affect analyzers
handle only a small set of the posed prototypic facial expressions. These include 6 basic emotions from portraits or
nearly frontal views of faces with no facial hair or glasses recorded under constant illumination. Context sensitive
interpretation of facial behavior is absent. Another limitation is that facial information is not analyzed on different
time scales, only short videos can be analyzed, hence the subject’s mood and attitude cannot be gauged for
extended periods of time. Limitations in vocal affect analyzers include – estimation of a single limited subset of
emotions. For example, a human may be feeling 2 emotions – fear and disgust, however, the current analyzer can
only output a single emotion. Other limitations are similar to the facial affect analyzers – doesn’t perform context
sensitive analysis, doesn’t extract vocal expression on larger time scales. The current analyzer can only perform
under noise free environments when the recorded sentences are short, exaggerated vocal expressions of affective
states delimited by pauses and carefully pronounced by non-smoking actors.
This paper largely presents the different challenges ahead for AM-HCIs in an accurate and comprehensive way. No
efforts beyond 2 modalities have been reported in analyzers. Further, the comprehension of a given emotional
label and the ways of expressing the related affective state may differ from culture to culture and even from
person to person – how that can be modeled effectively still remains a challenge. Another huge assumption made
while modeling current affects is that affective states begin and end with a neutral state. For example, a human is
neutral, then happy and finally neutral again. This is seldom the case given the complexity in expressive human
behavior. Transitions from one affective state to another may include multiple apexes and may be direct, without
an intermediate neutral state. However, this has not been captured in analyzers today. To make matters worse,
there doesn’t exist an easily accessible database that could be used to benchmark efforts in this area. The lack of
test sources is a huge impediment as well. Additional concerns include – at which abstraction levels are the
modalities to be fused? How many (and which) behavioral channels should be combined for realization of robust
and accurate AM-HCIs? How can the grammar of human expressive behavior be learned? How to resolve the
wrongly interpreted grammar? How to model the context around the user? All these questions are ongoing areas
of research in AM-HCI.
Hardenberg, C., & Bérard, F. (n.d.). Bare-hand human-computer interaction. Proceedings of the 2001 Workshop
on Perceptive User Interfaces - PUI '01.
The main motivation behind the selection of this paper is to discuss instances of the advancements that have been
made in unimodal HCI. This paper is a technical effort that discusses a novel approach to use hand gestures to
control digital displays. The hope is that innovative and natural unimodal interactive efforts can be combined to
produce multimodal interactions in the future.
The authors begin the paper by stating that there exist a number of ways to facilitate human-computer interaction
using hand held devices. However, they argue that natural interaction between humans doesn’t necessarily involve
devices because we have the ability to sense our environment with eyes and ears. In principle, a computer should
be able to imitate those abilities with cameras and microphones. The paper is intended to introduce HCI using bare
hands, which means that no devices are in contact with the body to interact with the computer. A full algorithm is
developed and implemented in a variety of situations by the authors.
A number of applications have been cited as the motivation for developing this kind of interaction. For example,
using this technique - during a presentation, a presenter doesn’t have to move between the computer and a
screen to select the next slide. Remote controls for TV sets, stereos and room lights can be replaced with this bare
hand technology. During a video conference, the camera’s attention could be acquired by stretching out a hand,
similar to a classroom situation. Finally, mobile devices with very limited space for UI could be operated with hand
gestures. The interaction discussed here is an example of perceptual user interfaces – which allow the creation of
computers that are not perceived as such. The main advantage of these interfaces over traditional ones include –
systems can be operated from a certain distance, the number of mechanical parts in a system can be reduced thus
making it more durable, systems can be protected from vandalism, in combination with speech recognition the
interaction between human and machine can be greatly simplified. In addition, vision based PUIs have an
advantage over speech recognition systems. The advantage is that they don’t disturb the flow of the conversation
(in a presentation) and work well in noisy environments. PUIs can lead to the creation of a class of applications that
allow projection on flat surfaces like walls and direct manipulation by hand.
There are certain basic requirements and objectives that are required from the development of such PUIs.
Functional requirements include detection of fingertips (to control the mouse pointer position), identification of
certain hand postures (stretched out fore-finger, number of fingers stretched visible), 2D and 3D positions of finger
tips and the palm (to extract complicated postures), tracking of the hand (the ability to re-run the identification
stage for each frame). Non-functional requirements include latency of hand vs pointer, resolution (smallest pointer
movement should be at-most as large as the smallest selectable object on the screen), stability (the tracked object
should not constantly move from its measured position)
The algorithm discussion is out of the scope of this document. However, it accurately is able to find the position of
the finger tips with minimum latency and maximum accuracy over a variety of factors such as – different speeds of
movement of hand, different illumination levels etc. Finally, some sample applications have been discussed as a
means to show its real world usage in unimodal and potentially multimodal contexts. The finger mouse is a basic
application that allows the user to let the finger behave like a mouse. Mouse clicks are generated by holding the
finger still for 1 second, the mouse wheel feature is activated by stretching out all 5 fingers. Another application
discussed is free hand present, which allows the user to navigate between power point slides easily. For example,
stretch 2 fingers for next slide, 3 fingers for previous slide, 5 fingers to open a slide navigation menu (navigate to
specific slide)
Stiefelhagen, R., & Yang, J. (n.d.). Gaze tracking for multimodal human-computer interaction. 1997 IEEE
International Conference on Acoustics, Speech, and Signal Processing.
This paper is another technical attempt to explore multimodal HCI. This paper introduces gaze tracking and its
applications. The authors state that gaze tracking can be a part of an active or a passive system. An example of a
passive system is when the system can identify user’s message target by monitoring the user’s gaze. An example of
an active gaze tracker would be when the user uses his/her gaze to directly control an application or launch
actions. Further, a gaze tracker can be used alone or can be combined with another system like speech
recognition. The gaze tracker developed in this paper is combined with speech recognition systems and estimates
the 3D position and rotation (pose) of the user’s head. It has been used in 2 applications – one that helps speech
recognition systems by switching language model and grammar based on user’s gaze information, and another
that illustrates the combination of the gaze tracker and a speech recognizer to view a panorama image.
The authors state that while multimodal interfaces offer greater flexibility and robustness, they have largely been
pen or voice based, user activated and operate in settings where headsets, suits, buttons and other constraining
devices are required. If more freedom is to be provided to users, some more important parameters of the
communicative situation have to be identified. Early attempts for gaze trackers involved the user to wear
specialized head gear or other expensive hardware. It is only recently that non-intrusive gaze trackers have been
developed that leverage software (using methods like weak perspective projections).
A person’s gaze direction can be determined by 2 parameters – the head orientation and the eye orientation. In
the paper, the authors are talking about developing a system that only considers head orientation. The authors’
model of a gaze tracker is a non-intrusive one that tracks six facial feature points such as face, eyes, lip corners etc.
The discussion of the algorithm used for identification and tracking is beyond the scope of this document.
However, its applications to multimodal interfaces have been discussed below.
One application could be for activating a window on a screen or directing inquiries using gaze tracking. However,
one issue that comes up in such applications is of reliability of the gaze information. Even if the gaze tracker could
provide high accuracy gaze information, gaze information itself may not be a reliable indicator of the action to be
performed. For example, when a user sits in front of a screen, she may simply be looking at the screen without any
expectation of an action to be performed, even though her attention is on the screen. If the tracker gauges this
gaze information, it may incorrectly launch some application present on the screen. A solution is to combine the
gaze with other modalities to increase reliability. Another application for gaze trackers could be to provide
monitoring of eye gaze pattern, blink rate and pupil size. If any anomalies are noticed, alert signals could be
reported. This is useful when monitoring employees present in air traffic controls or nuclear power plants, whose
alertness and consistency of gaze is of utmost importance. Another implementation is that of the panoramic image
viewer as performed in the paper. In this example, an interface has been developed that uses gaze to control the
scrolling through panoramic images and uses voice commands to control the zoom. The interface receives
parameters from the user’s head from the gaze tracker and parameters for spoken commands from the speech
recognizer.
Shah, S., Teja, J., & Bhattacharya, S. (2015). Towards affective touch interaction: Predicting mobile user emotion
from finger strokes. J Interact Sci Journal of Interaction Science.
This article is based on an experiment conducted by the authors that aim to make systems more responsive to the
user’s needs and expectations. The authors feel that the first step towards affective interaction is to recognize
user’s emotional state. They feel that the design of an application is important as it can change the user’s affective
state. For example, the number of steps required to perform a task may be reduced if the user is in a happy state,
so that the user experience improves. The key question here is – how can we recognize the emotional state of a
user? This becomes even more challenging when the experiments are performed under ordinary circumstances –
no expensive equipment or set up. Hence, the journal aims to capture affect using interactions with commonly
used devices.
The authors have tried to classify the emotion into 3 categories – positive (representing happy, excited etc.),
negative (representing frustration, sadness, fear etc.) and neutral (representing calm, content emotions). The
authors assume that touch interaction characteristics of users is an indirect indicator of the user’s emotional state.
Along with research supporting this, this assumption intuitively makes sense as well. When we are happy and
excited, our touch interactions will be fast, jittery and more erroneous when compared to when we are in a calm
state. Hence, the authors’ work aims to detect emotions for users of mobile touch devices like smartphones and
tablets.
The authors use 3 finger actions in their touch interaction – down (time instance when the finger touches the
screen), up (timestamp when the finger is released) and move (after down action, if the finger moves on the
screen without up action, it is called move action). A tap is a combination of down and up actions whereas a strike
is a combination of down, up and move actions. The strike length differentiates whether the intended action was a
tap or a strike. If it is below a threshold value, it is a tap, else it is a strike. Several metrics such as deviation in
number of strikes, average strike length, average strike speed, total, average and mode delay are considered in the
model. The data is collected from 57 participants using common tablets. The 57 participants are broken down into
training and test sets and each set is further broken down by the emotional states of the participants (positive,
negative, neutral). After the setup, each participant is required to perform 7 tasks in a single session and each of
metrics mentioned before are recorded. Different classification and regression techniques are applied to analyze
results.
The discussion that follows concludes that there are multiple ways in which the model can be used. For example, if
we know the emotional state of the user, we can change the look and feel of the interface to complement the
emotional state. For example, if the emotional state is negative, then we can try to display bright/happy colors to
appease the user. We can also make changes in the way tasks are performed depending on the current emotional
state. This may lead to “polite” interfaces that are empathetic, which can improve user experiences.
Bramwell-Dicks, A., Petrie, H., Edwards, A., & Power, C. (n.d.). Affective Musical Interaction: Influencing Users’
Behaviour and Experiences with Music. Music and Human-Computer Interaction Springer Series on Cultural
Computing, 67-83.
This article, adopted from the book entitled ‘Music and Human Computer Interaction’ describes some of the
research conducted in other fields that have already embraced the affective characteristic of music within their
context. It also discusses the limited amount of research conducted in this field and provides potential motivations
for working with affective musical interaction.
The chapter begins by highlighting research conducted in non-speech audio interaction. It states that these sounds
must be short in length (like a message ring on a phone) and must convey a specific meaning. Longer rings can
often annoy users if played repeatedly. It moves into further discussion on the potential for music to be used for
more serious tasks other than leisurely listening. For example, the authors talk about how the different genre,
tempo and type of music played in supermarkets can influence people’s spending behavior. They highlight a study
that was conducted by Milliman in which customers who heard slow tempo music while shopping spent more time
and money compared to customers who heard fast tempo music. The authors argue that this may be extended to
online shopping in which users shopping on Amazon for example, can be influenced to buy specific items and/or
spend more time browsing. The authors also ask if listening to classical music while shopping for furniture online
will lead to the purchase of more expensive furniture. The authors extend their discussion to fields like sports and
athletic performance. According to the authors, there is evidence to suggest that sporting performance may be
improved because the accompanying music acts as a distractor from the discomfort felt while performing sporting
activities like running or cycling.
The discussion moves towards the field of music psychology in which there is an ongoing debate between the
cognitivist versus the emotivist aspects of music on psychology. The cognitivist view argues that listeners can
perceive emotions from music. The emotivist view argues that music can actually change the listener’s felt
emotions. The authors are more biased towards the emotivist view and raise the question if music can be included
to positively enhance users’ felt emotions, especially in boring or stressful situations?
Another study that has been analyzed by the authors involves the effect of music when performing activities such
as typing. How does music affect speed and accuracy of typing? Does the tempo of music matter? It was found
that a dirge song playing in the background reduced the speed of typing compared to no music or jazz music. The
accuracy of typing increased in the Jazz case compared to the other cases.
It is clear that music does affect performance of humans in a variety of settings. However, there has been little
research and application in this area and the authors conclude the chapter by providing some direction for
research. The authors believe that some aspects must be considered when one integrates music with interactive
technology – how are users’ experiences and behaviors affected? What features of the music most affects these
behaviors? Some dependent variables in the potential research may include - Can musical interfaces make stressful
situations more pleasant for the user (stress as a dependent variable)? Can musical interfaces make mundane tasks
more enjoyable (satisfaction as a dependent variable)? One alternate dependent variable may be the time taken to
complete the tasks with music vs without music. Independent variables include elements like pitch, tempo, range,
key, instrumentality (vs lyric heavy) and syncopation of the music.
Chen, Y., Lv, M., & Guo, L. (2015). Study on Optimal Design of Digital Music Player Based on Human-computer
Interaction. International Journal of Signal Processing, Image Processing and Pattern Recognition IJSIP, 135-146.
This journal uses principles in HCI design to optimize the design of a digital music player. The journal does a
thorough analysis on the cognitive features for interaction, hierarchy of needs and emotional information of target
users to come up with a recommended humanized design. The main motivation of choosing such a journal was to
understand the factors that influence the redesign of a system. Alongside, I would also like to share my thoughts
on how this design can further be taken to the next level with multimodal interaction.
The authors start the paper by critiquing existing digital media players. They feel that there are a number of short
comings in the status quo. For example, they argue that there exist a number of redundant features on current
interfaces. Many features are never even used by the target user and they only distract users. They also feel that
current systems have very poor fault tolerance. For example, if a user plays a format that is not supported by the
media player, the player may shut down or behave in some other unexpected way.
The author states that in interaction design, the mode and sequence of options that will be presented to users
must be determined and options that will influence users’ implementing and finishing tasks should be focused. An
initial analysis on human cognitive features is done. The authors conclude that the most important index of judging
the HCI design of the product is whether the operations of the product rationally and effectively conform to users’
cognitive inertia or expectations. Next, the hierarchy of needs in interaction design is discussed. These include
sensory needs such as vision, touch, personalization needs etc. Next, an analysis on human emotion information is
performed. Research shows that the emotions are manifested in the following order in HCI: sense  judgement 
behavior  manifestation.
Finally, the optimizations are performed on existing digital music players. Availability of functions such as
completion of auditory appreciation (options to adjust bass, treble etc.), agreeableness with the layout, font styles
and color contrast, touch optimized interface for mobile devices etc. are key for optimal design. Another
optimization that can be performed is the expansion of the player’s use scope for special groups (users with
hearing or visual impairments). Appropriate volume controls can be enforced to allow for a comfortable listening
range for users. For touch interfaces, uniform touch force for user controls, uniform response speed and timely
feedback are some performance metrics to be considered. Some other metrics include the number of steps taken
by typical users to perform certain common tasks – like navigating to a song. Another optimization identified
includes the central placement and larger size of buttons (ergonomic design optimization) on the interface to
improve the user experience for users with action inaccuracy (elderly people) or users with large fingers. Yet
another optimization is how well the application can handle the expectation of the user’s erroneous action. For
example, better search algorithms can be employed to detect the wrong entry of a song or artist in the catalog.
A thought that I have pondered on for years now is how affective computing can be used in emotional experiences
of users. I imagine that digital music players could be optimized in their functioning if they can sense the user’s
current affect for music recommendation. For example, majority users listen to music on their phones while
performing some other activity like running, studying etc. There are many online streaming apps available in the
market like Spotify and Pandora which can successfully predict musical tastes for their audience (the capability is
already present). However, it can be annoying for users to change individual songs on these apps as they change
their affective state, which tends to happen quite naturally overtime. It would be useful if the music could change
automatically by sensing the change in user’s mood. As an example, if a jogger listening to high tempo music
(affective state is excitement) wants to take temporary rest after running a considerable distance, the app should
be smart enough to detect the change in affective state (changed to tiredness) and should suggest some soothing
music that better fits the jogger’s emotional context now. However, these developments are going to be possible
only with more research in affective computing and how HCI can integrate emotions in the user’s experience.
Summary
In conclusion, we have discussed two import aspects of Human Computer Interaction in detail – affective
computing and multimodal HCI. We have seen certain applications in unimodal HCI including a bare-hand HCI using
PUI to control digital displays and a gaze tracker to control digital displays using the subject’s gaze. We have
discussed the possibilities of combining these unimodal interfaces with other modes such as speech recognition
systems to develop multimodal systems. We have also studied affective computing by analyzing human emotions
based on touch gestures on tablets. We also discussed the possibility of making music change user behavior in
certain interaction contexts. Besides this, we discussed the current status of affective multimodal systems, their
challenges, potential research opportunities and possibilities in the future. Lastly, we also discussed the
optimization of a digital music player based on traditional HCI principles. We tried to extend our discussion by
thinking of music recommendation systems that are context sensitive to the affective state of the user.
It is clear that this area of research has tremendous potential and is constantly evolving as we speak. We have
observed a major interest shown by companies that are integrating efforts to realize multimodal systems and we
are just at the brink of this major computing revolution.
Lastly, the future of HCI as a field will emerge to integrate into ubiquitous communication powered by the cloud.
Each communicating device will be a high functionality system capable of multimodal interaction available in large
and thin displays that will make our user experience exciting, media rich and more interpersonal than ever before.

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction hci
Introduction hciIntroduction hci
Introduction hcisawsan slii
 
Human-Computer Interaction: An Overview
Human-Computer Interaction: An OverviewHuman-Computer Interaction: An Overview
Human-Computer Interaction: An OverviewSabin Buraga
 
HUMAN COMPUTER INTERACTION
HUMAN COMPUTER INTERACTIONHUMAN COMPUTER INTERACTION
HUMAN COMPUTER INTERACTIONshahrul aizat
 
Human computerinterface
Human computerinterfaceHuman computerinterface
Human computerinterfaceKumar Aryan
 
HCI : Activity 1
HCI : Activity 1 HCI : Activity 1
HCI : Activity 1 autamata4
 
Interaction design
Interaction designInteraction design
Interaction designDian Oktafia
 
HCI - Chapter 4
HCI - Chapter 4HCI - Chapter 4
HCI - Chapter 4Alan Dix
 
Information Architecture - introduction
Information Architecture - introduction Information Architecture - introduction
Information Architecture - introduction Asis Panda
 
Touch Research 3: How Bodies Matter [Handouts]
Touch Research 3: How Bodies Matter [Handouts]Touch Research 3: How Bodies Matter [Handouts]
Touch Research 3: How Bodies Matter [Handouts]Harald Felgner, PhD
 
Automated UI & UX Framework
Automated UI & UX FrameworkAutomated UI & UX Framework
Automated UI & UX FrameworkIJARIIT
 
Socio-cultural User Experience (SX) and Social Interaction Design (SxD)
Socio-cultural User Experience (SX) and Social Interaction Design (SxD)Socio-cultural User Experience (SX) and Social Interaction Design (SxD)
Socio-cultural User Experience (SX) and Social Interaction Design (SxD)Samir Dash
 
Media Computerization
Media ComputerizationMedia Computerization
Media ComputerizationBaljeet Singh
 
Master of Exhibit Design at La Sapienza University, Introduction and Lesson 1
Master of Exhibit Design at La Sapienza University, Introduction and Lesson 1Master of Exhibit Design at La Sapienza University, Introduction and Lesson 1
Master of Exhibit Design at La Sapienza University, Introduction and Lesson 1Salvatore Iaconesi
 
What is Human Computer Interraction
What is Human Computer InterractionWhat is Human Computer Interraction
What is Human Computer Interractionpraeeth palliyaguru
 
hcid2011 - Gesture Based Interfaces: Jacques chueke (HCID, City University L...
hcid2011 -  Gesture Based Interfaces: Jacques chueke (HCID, City University L...hcid2011 -  Gesture Based Interfaces: Jacques chueke (HCID, City University L...
hcid2011 - Gesture Based Interfaces: Jacques chueke (HCID, City University L...City University London
 

Was ist angesagt? (20)

NUI_jaydev
NUI_jaydevNUI_jaydev
NUI_jaydev
 
Introduction hci
Introduction hciIntroduction hci
Introduction hci
 
Human-Computer Interaction: An Overview
Human-Computer Interaction: An OverviewHuman-Computer Interaction: An Overview
Human-Computer Interaction: An Overview
 
Introduction To HCI
Introduction To HCIIntroduction To HCI
Introduction To HCI
 
Hci activity#1
Hci activity#1Hci activity#1
Hci activity#1
 
HUMAN COMPUTER INTERACTION
HUMAN COMPUTER INTERACTIONHUMAN COMPUTER INTERACTION
HUMAN COMPUTER INTERACTION
 
Human computerinterface
Human computerinterfaceHuman computerinterface
Human computerinterface
 
HCI : Activity 1
HCI : Activity 1 HCI : Activity 1
HCI : Activity 1
 
C0353018026
C0353018026C0353018026
C0353018026
 
Interaction design
Interaction designInteraction design
Interaction design
 
HCI - Chapter 4
HCI - Chapter 4HCI - Chapter 4
HCI - Chapter 4
 
Information Architecture - introduction
Information Architecture - introduction Information Architecture - introduction
Information Architecture - introduction
 
Touch Research 3: How Bodies Matter [Handouts]
Touch Research 3: How Bodies Matter [Handouts]Touch Research 3: How Bodies Matter [Handouts]
Touch Research 3: How Bodies Matter [Handouts]
 
Automated UI & UX Framework
Automated UI & UX FrameworkAutomated UI & UX Framework
Automated UI & UX Framework
 
Human Computer Interaction
Human Computer InteractionHuman Computer Interaction
Human Computer Interaction
 
Socio-cultural User Experience (SX) and Social Interaction Design (SxD)
Socio-cultural User Experience (SX) and Social Interaction Design (SxD)Socio-cultural User Experience (SX) and Social Interaction Design (SxD)
Socio-cultural User Experience (SX) and Social Interaction Design (SxD)
 
Media Computerization
Media ComputerizationMedia Computerization
Media Computerization
 
Master of Exhibit Design at La Sapienza University, Introduction and Lesson 1
Master of Exhibit Design at La Sapienza University, Introduction and Lesson 1Master of Exhibit Design at La Sapienza University, Introduction and Lesson 1
Master of Exhibit Design at La Sapienza University, Introduction and Lesson 1
 
What is Human Computer Interraction
What is Human Computer InterractionWhat is Human Computer Interraction
What is Human Computer Interraction
 
hcid2011 - Gesture Based Interfaces: Jacques chueke (HCID, City University L...
hcid2011 -  Gesture Based Interfaces: Jacques chueke (HCID, City University L...hcid2011 -  Gesture Based Interfaces: Jacques chueke (HCID, City University L...
hcid2011 - Gesture Based Interfaces: Jacques chueke (HCID, City University L...
 

Ähnlich wie Multimodal and Affective Human Computer Interaction - Abhinav Sharma

Multimodal man machine interaction
Multimodal man machine interactionMultimodal man machine interaction
Multimodal man machine interactionDr. Rajesh P Barnwal
 
A paper on HCI by Nalaemton and Mervin
A paper on HCI by Nalaemton and MervinA paper on HCI by Nalaemton and Mervin
A paper on HCI by Nalaemton and MervinNalaemton S
 
The Transformation Process That Interface Went Through
The Transformation Process That Interface Went ThroughThe Transformation Process That Interface Went Through
The Transformation Process That Interface Went ThroughAparna Harrison
 
Finger tracking in mobile human compuetr interaction
Finger tracking in mobile human compuetr interactionFinger tracking in mobile human compuetr interaction
Finger tracking in mobile human compuetr interactionAkhil Kumar
 
Human computer Interaction
Human computer InteractionHuman computer Interaction
Human computer Interactionshafaitahir
 
What Is Interaction Design
What Is Interaction DesignWhat Is Interaction Design
What Is Interaction DesignGraeme Smith
 
Introduction to HCI
Introduction to HCI Introduction to HCI
Introduction to HCI Deskala
 
Human Computer Interaction
Human Computer InteractionHuman Computer Interaction
Human Computer InteractionIRJET Journal
 
Importance of UX-UI in Android/iOS Development- Stackon
Importance of UX-UI in Android/iOS Development- StackonImportance of UX-UI in Android/iOS Development- Stackon
Importance of UX-UI in Android/iOS Development- Stackonnajam gs
 
Separation of Organic User Interfaces: Envisioning the Diversity of Programma...
Separation of Organic User Interfaces: Envisioning the Diversity of Programma...Separation of Organic User Interfaces: Envisioning the Diversity of Programma...
Separation of Organic User Interfaces: Envisioning the Diversity of Programma...Felix Epp
 
I2126469
I2126469I2126469
I2126469aijbm
 
Human–computer interaction (HCI), alternatively man–machine interaction (MMI)...
Human–computer interaction (HCI), alternatively man–machine interaction (MMI)...Human–computer interaction (HCI), alternatively man–machine interaction (MMI)...
Human–computer interaction (HCI), alternatively man–machine interaction (MMI)...SazzadHossain764310
 
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINESORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINESijasuc
 

Ähnlich wie Multimodal and Affective Human Computer Interaction - Abhinav Sharma (20)

HCI First Lecture.pptx
HCI First Lecture.pptxHCI First Lecture.pptx
HCI First Lecture.pptx
 
Multimodal man machine interaction
Multimodal man machine interactionMultimodal man machine interaction
Multimodal man machine interaction
 
A paper on HCI by Nalaemton and Mervin
A paper on HCI by Nalaemton and MervinA paper on HCI by Nalaemton and Mervin
A paper on HCI by Nalaemton and Mervin
 
The Transformation Process That Interface Went Through
The Transformation Process That Interface Went ThroughThe Transformation Process That Interface Went Through
The Transformation Process That Interface Went Through
 
2 4-10
2 4-102 4-10
2 4-10
 
Being Human
Being HumanBeing Human
Being Human
 
Finger tracking in mobile human compuetr interaction
Finger tracking in mobile human compuetr interactionFinger tracking in mobile human compuetr interaction
Finger tracking in mobile human compuetr interaction
 
CHAPTER 1 RESUME.pptx
CHAPTER 1 RESUME.pptxCHAPTER 1 RESUME.pptx
CHAPTER 1 RESUME.pptx
 
HCI.pdf
HCI.pdfHCI.pdf
HCI.pdf
 
Interactive tools
Interactive toolsInteractive tools
Interactive tools
 
Human computer Interaction
Human computer InteractionHuman computer Interaction
Human computer Interaction
 
What Is Interaction Design
What Is Interaction DesignWhat Is Interaction Design
What Is Interaction Design
 
Hci 01
Hci 01Hci 01
Hci 01
 
Introduction to HCI
Introduction to HCI Introduction to HCI
Introduction to HCI
 
Human Computer Interaction
Human Computer InteractionHuman Computer Interaction
Human Computer Interaction
 
Importance of UX-UI in Android/iOS Development- Stackon
Importance of UX-UI in Android/iOS Development- StackonImportance of UX-UI in Android/iOS Development- Stackon
Importance of UX-UI in Android/iOS Development- Stackon
 
Separation of Organic User Interfaces: Envisioning the Diversity of Programma...
Separation of Organic User Interfaces: Envisioning the Diversity of Programma...Separation of Organic User Interfaces: Envisioning the Diversity of Programma...
Separation of Organic User Interfaces: Envisioning the Diversity of Programma...
 
I2126469
I2126469I2126469
I2126469
 
Human–computer interaction (HCI), alternatively man–machine interaction (MMI)...
Human–computer interaction (HCI), alternatively man–machine interaction (MMI)...Human–computer interaction (HCI), alternatively man–machine interaction (MMI)...
Human–computer interaction (HCI), alternatively man–machine interaction (MMI)...
 
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINESORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
ORGANIC USER INTERFACES: FRAMEWORK, INTERACTION MODEL AND DESIGN GUIDELINES
 

Multimodal and Affective Human Computer Interaction - Abhinav Sharma

  • 1. Multimodal and Affective Human Computer Interaction By Abhinav Sharma (aus2101@columbia.edu) Introduction Until the 1970s, the only “humans” who “interacted” with computers were technology specialists and dedicated hobbyists. The advent of the personal computing era, which brought along with it text editors, games and graphic based operating systems aimed to equip the “average human” with superhuman capabilities. However, at that time, there was very little focus on the usability aspect of hardware and software. Fortunately, there were a wonderful series of events taking place in parallel. Communities for the broad areas of study in cognitive science, artificial intelligence, linguistics, cognitive psychology and anthropometry were being formed to address this usability issue between man and machine. Hence, HCI was developed as one of the first examples of cognitive engineering. These communities greatly helped shape the computing experiences that we have today. As technology progressed and these usability issues were brought to light, attempts were made to make computers more user friendly for daily users. Command line inputs were replaced with graphical inputs, search features were added to fetch applications instead of explicitly navigating to their location, common user controls were used and standardized overtime to provide a consistent interactive experience to users. More recently, human computer interaction has become an increasingly important field of interest amongst technology and product companies. The industry has seen a shift in customer loyalty, loss in market share and strong culture of disgust in companies that have been dormant about incorporating human centered designs in their products. From CLIs, Traditional GUIs and Touch Enabled UIs to Voice User Interfaces (VUIs) and holographic computing; the world has seen a rapidly changing environment which is redefining the way we do things on a daily basis. Now, more than ever, the marriage of product-software design with the user’s intuition and cognitive expectation is paramount when it comes to delivering a successful user experience. Platform companies like Microsoft have made it as a part of their mission to provide a consistent user experience which is independent of the device they use – whether it be a Surface tablet, or a Windows PC or an augmented reality experience using Hololens. There are a couple of interesting aspects about human computer interaction – multimodality and affection. Windows 10 PCs are integrated with a personalized VUI like Cortana which allows the user to not only interact with the computer by using a traditional QWERTY keyboard, but also allows inputs using voice enabled commands. This represents multimodality – or multiple ways to interact with a machine. However, this is not the true definition of multimodal as these interactions are mutually exclusive and are currently not performed in an uninterrupted synchronized way. A true multimodal interaction will attempt to bridge these interactions in a seamless way to achieve a natural method of communication with the machine. In this paper, I will try and expound on the current research in multimodal HCI and express the progress, possibilities and limitations in this field. The second area of interest is affective HCI. This field is rather absurd. It tries to answer questions like – can the computer understand human emotions like anger, happiness etc.? Can human behavior and computer interaction be influenced by music? There is limited research in this field at this time; however, I will attempt to give my views on it and present further possibilities that can be predicted in this domain. These articles will be prefaced by a general introductory article that introduces the different styles that exist in HCI. This article shows the status quo on the current application environment in computing today.
  • 2. Shneiderman, B. (n.d.). We can design better user interfaces: A review of human-computer interaction styles. Ergonomics, 699-710. This article provides the reader with an introduction to the term ‘Human Computer Interaction’ and the different modes that exist within it. Although the article dates back to 1988, it provides a foundation in the understanding of HCI and recommends three pillars to support the user interface design process. Along with the article, I will provide my views on the updates in the field to bridge the article with the developments that have taken place in the past years. The author begins the article by heralding the next decade as the ‘golden age of ergonomics’, given the developments in HCI at the time the article is written. He differentiates physical ergonomists from cognitive ergonomists and states that the latter have made considerable development in areas such as screen layout, graphic design, color choices and knowledge organization. He then discusses the 3 pillars of user interface development. The first of these are ‘the guidelines documents’ or rules for interface design. The next 2 pillars are UI Management systems (prototyping tools, graphic tools) and usability laboratories for iterative testing. He states that certain metrics for testing include time for user to complete tasks, speed and performance on benchmark tasks, rate and distribution of errors, subjective satisfaction and user’s retention of syntax and semantics of the tool. It is quite surprising to see how accurate and comprehensive those metrics are, given that they capture nearly all performance metrics used in modern day testing. Next, the author discusses the taxonomy of interaction styles. The 5 classes discussed are – menu selection, form fill in, command language (CLI), Natural language interaction (voice UI), Direct Manipulation (mouse, touch UI). Though most of these modes still exist today, there have been some additions as well. For example, motion tracking interfaces monitor the user body motions and translate them into commands. As we will see later, there exist Perceptual UIs. GUIs are fairly common with Operating systems being a major example. Other common interfaces are holographic interfaces (Hololens), gaze trackers and Natural language interface (Google). There can also be a combination of one or more of these unimodal interfaces, which leads us to multimodal interfaces. These will be discussed later in detail. The author provides a detailed description for each of the five types of interaction styles along with their advantages and disadvantages. The discussion of each of these styles is not very fruitful as they are trivial by today’s standards. Hence, we will focus our discussion on factors that influence the choices of these interfaces – an issue highlighted by the author that is still prevalent today. The author states that intermittent and expert users of UIs would prefer abbreviations and shortcuts to perform tasks when compared to novices, who need meaningful labels and standard established procedures. However, the key question now becomes – which user do we develop the UI for? For a novice user, a simple touch UI (optimized for quick task accomplishment) can seem intimidating whereas it may seem trivial for expert users. This may lead to resistance in adaptability of new technology and interfaces by certain user segments. The real challenge now for companies is to help these novices to cross the learning curve to embrace new technology. To make matters worse, due to the constantly evolving nature of the industry, users have to constantly learn new things. Another major issue for technology companies is how to make earlier inefficient tasks on the computer much easier for the user today. Should the designers get rid of old and inefficient ways of achieving a task? How will the users react if certain features are made unavailable in the next product (app) release? Will they switch to a competitor’s product? If all features are retained, the complexity of the product itself may increase, which in turn makes the product unusable. All these design decisions are paramount at an early stage to avoid disaster later.
  • 3. Turk, M. (n.d.). Multimodal interaction: A review. International Journal of Signal Processing, Image Processing and Pattern Recognition IJSIP, 189-195. This article is a great summary of the current research that has occurred in the field of multimodal HCI. The author begins by stating that all human interaction is inherently multimodal as we employ multiple senses – both in parallel and in series to actively explore our environment and perceive new information. In contrast, the author argues that HCI has historically focused on unimodal communication, which is inherently a less natural way to communicate. We can, for instance only type on our keyboard to act as an input for the screen. While technically we have attempted to interact with computers on a multimodal dimension – for example, using the mouse and keyboard as simultaneous inputs, we are far from achieving multimodality in its true sense. Multimodal HCI attempts to present human communication abilities – primarily in the form of speech, gesture, touch and facial expression – in more sophisticated pattern recognition styles and classification methods. The author talks about the origins of multimodal HCI and the early works that have been done on it. Richard Bolt has been cited to be a key figure in conducting early experiments at the MIT media lab to bring multimodal HCI to life with his “Put That There” system. Essentially, the system integrated voice and gesture inputs to enable a user sitting in a chair to have a natural interaction with a wall display in the context of a spatial data management system. For example, one could suggest commands like “Create a blue circle there” or “Move the square on the right of the circle” to this system to get the desired outputs. The author then talks about the different advantages of multimodal HCIs. The following is a subset of those along with some thought provoking questions. One advantage is the increase in task efficiency. Although this may be seen as an advantage in the long term, but what about the learning curve involved in the beginning? I argue that humans might be slower in their interaction initially as they would simply not be very used to interacting with the machine multimodally. It will be paramount to make these multimodal interactions seem natural and unobtrusive to truly engage humans in this kind of interaction. The author also states that information is processed faster and better when presented in multiple modes. I wonder if the increase in efficiency of the system is significant and if so, then is this increase significant for all forms of multitasking combinations of senses? For example, humans can see and hear simultaneously – they do that job very well, however, can they type and speak (2 forms of inputs) with an increased efficiency when compared to typing or speaking alone? Those 2 sets of inputs may actually seem like a less natural way of communicating for humans. Other advantages include – flexible and integrated use of input modes, greater precision of display in spatial information, accommodation of wider range of users and environmental situations, involvement of handicapped users (for example, blind people can use multimodal interaction through speech, gestures) The author then discusses a set of multimodal myths and design guidelines that would be useful in designing systems. Some myths that require further research to debunk include – If you build a multimodal system, users will interact multimodally. For example, in an I phone, both touch input and Siri are enabled. However, only a small subset of all I phone users actually use both. Another myth is that efficiency, error resolution and user satisfaction may not always increase with multimodal systems. Yet another is that many error prone recognition technologies may combine multimodally to produce even greater unreliability. Some guidelines discussed are as follows – multimodal systems should be designed for a broad range of users and contexts so that their advantage can be realized. Privacy and security issues should be considered seriously in multimodal systems. For example, non- speech alternatives should be available in a public context where the user is required to input confidential information. Multimodal interfaces should be customized to adapt to the needs and abilities of all users using user profiles. For example, machine learning algorithms can be employed to better understand individual user context.
  • 4. Finally, system outputs, switching and presentation should be consistent. These are also the biggest challenges ahead for multimodal HCI. Finally, the author talks about integration. He argues that some modal combinations are intended to be interpreted in parallel, while others sequential. He discusses the classification of multimodal interfaces in a 2x2 matrix on the basis of their fusion method (combined or independent) and use of modalities (sequential or parallel). This gives rise to 4 kinds of systems – exclusive, alternative, concurrent and synergetic multimodal systems. For example, in exclusive systems, modalities are used sequentially and are not integrated by the system. The author finally discusses early vs integration models in multimodal systems. For example, after streams of data have been input from different modes, should the data be interpreted uni-modally before being integrated (late integration) or vice versa (early integration)? Pantic, M., Sebe, N., Cohn, J., & Huang, T. (n.d.). Affective multimodal human-computer interaction. Proceedings of the 13th Annual ACM International Conference on Multimedia - MULTIMEDIA '05. The authors start the discussion by raising a valid concern for AM-HCIs. They claim that HCI design was first dominated by direct manipulation and then by delegation. However, the tacit assumption of both approaches has been that humans will be explicit, unambiguous and fully attentive while controlling information and command flow. This is a major impediment to having flexible machines capable of adapting to users’ levels of attention, preferences, moods and intentions. The authors however feel that there is tremendous potential in AM-HCI given the range of application domains it covers. For example, it can be used for automatic affective assessment of boredom, inattention and stress in jobs with high risks such as air traffic control, nuclear power plant control and vehicle control. It can be applied to specialized professional fields where behavioral cues are important, such as lie detection in police agencies. It can also hugely impact behavioral sciences, neurology and research. The problem domain for AM-HCI systems can be summarized as follows – what is an affective state? How can they be accurately represented? Which human communicative signals convey information about affective states? Are facial expressions enough or do we need to analyze body gestures as well? How to best integrate information across modalities for emotion recognition? How can this information be accurately quantified? The authors present a number of experiments that have been conducted that ultimately suggest an ongoing research in each of these thought provoking domains. The authors converge on the definition of the capabilities of an ideal automatic human affect analyzer. The ideal system should be multimodal and should produce robust and accurate estimations of emotions despite occlusions, changes in viewing and lighting conditions and in the presence of ambient noise. Another capability of the system is genericity – it should be independent of the age, sex or ethnicity of the subject. It should also be sensitive to the dynamics of the displayed affective expressions and context i.e. it should be able to perform temporal analysis on the sensed data while realizing the current environment. The current status quo has been discussed for facial and vocal affect analyzers. Current facial affect analyzers handle only a small set of the posed prototypic facial expressions. These include 6 basic emotions from portraits or nearly frontal views of faces with no facial hair or glasses recorded under constant illumination. Context sensitive interpretation of facial behavior is absent. Another limitation is that facial information is not analyzed on different time scales, only short videos can be analyzed, hence the subject’s mood and attitude cannot be gauged for extended periods of time. Limitations in vocal affect analyzers include – estimation of a single limited subset of emotions. For example, a human may be feeling 2 emotions – fear and disgust, however, the current analyzer can only output a single emotion. Other limitations are similar to the facial affect analyzers – doesn’t perform context sensitive analysis, doesn’t extract vocal expression on larger time scales. The current analyzer can only perform
  • 5. under noise free environments when the recorded sentences are short, exaggerated vocal expressions of affective states delimited by pauses and carefully pronounced by non-smoking actors. This paper largely presents the different challenges ahead for AM-HCIs in an accurate and comprehensive way. No efforts beyond 2 modalities have been reported in analyzers. Further, the comprehension of a given emotional label and the ways of expressing the related affective state may differ from culture to culture and even from person to person – how that can be modeled effectively still remains a challenge. Another huge assumption made while modeling current affects is that affective states begin and end with a neutral state. For example, a human is neutral, then happy and finally neutral again. This is seldom the case given the complexity in expressive human behavior. Transitions from one affective state to another may include multiple apexes and may be direct, without an intermediate neutral state. However, this has not been captured in analyzers today. To make matters worse, there doesn’t exist an easily accessible database that could be used to benchmark efforts in this area. The lack of test sources is a huge impediment as well. Additional concerns include – at which abstraction levels are the modalities to be fused? How many (and which) behavioral channels should be combined for realization of robust and accurate AM-HCIs? How can the grammar of human expressive behavior be learned? How to resolve the wrongly interpreted grammar? How to model the context around the user? All these questions are ongoing areas of research in AM-HCI. Hardenberg, C., & Bérard, F. (n.d.). Bare-hand human-computer interaction. Proceedings of the 2001 Workshop on Perceptive User Interfaces - PUI '01. The main motivation behind the selection of this paper is to discuss instances of the advancements that have been made in unimodal HCI. This paper is a technical effort that discusses a novel approach to use hand gestures to control digital displays. The hope is that innovative and natural unimodal interactive efforts can be combined to produce multimodal interactions in the future. The authors begin the paper by stating that there exist a number of ways to facilitate human-computer interaction using hand held devices. However, they argue that natural interaction between humans doesn’t necessarily involve devices because we have the ability to sense our environment with eyes and ears. In principle, a computer should be able to imitate those abilities with cameras and microphones. The paper is intended to introduce HCI using bare hands, which means that no devices are in contact with the body to interact with the computer. A full algorithm is developed and implemented in a variety of situations by the authors. A number of applications have been cited as the motivation for developing this kind of interaction. For example, using this technique - during a presentation, a presenter doesn’t have to move between the computer and a screen to select the next slide. Remote controls for TV sets, stereos and room lights can be replaced with this bare hand technology. During a video conference, the camera’s attention could be acquired by stretching out a hand, similar to a classroom situation. Finally, mobile devices with very limited space for UI could be operated with hand gestures. The interaction discussed here is an example of perceptual user interfaces – which allow the creation of computers that are not perceived as such. The main advantage of these interfaces over traditional ones include – systems can be operated from a certain distance, the number of mechanical parts in a system can be reduced thus making it more durable, systems can be protected from vandalism, in combination with speech recognition the interaction between human and machine can be greatly simplified. In addition, vision based PUIs have an advantage over speech recognition systems. The advantage is that they don’t disturb the flow of the conversation (in a presentation) and work well in noisy environments. PUIs can lead to the creation of a class of applications that allow projection on flat surfaces like walls and direct manipulation by hand.
  • 6. There are certain basic requirements and objectives that are required from the development of such PUIs. Functional requirements include detection of fingertips (to control the mouse pointer position), identification of certain hand postures (stretched out fore-finger, number of fingers stretched visible), 2D and 3D positions of finger tips and the palm (to extract complicated postures), tracking of the hand (the ability to re-run the identification stage for each frame). Non-functional requirements include latency of hand vs pointer, resolution (smallest pointer movement should be at-most as large as the smallest selectable object on the screen), stability (the tracked object should not constantly move from its measured position) The algorithm discussion is out of the scope of this document. However, it accurately is able to find the position of the finger tips with minimum latency and maximum accuracy over a variety of factors such as – different speeds of movement of hand, different illumination levels etc. Finally, some sample applications have been discussed as a means to show its real world usage in unimodal and potentially multimodal contexts. The finger mouse is a basic application that allows the user to let the finger behave like a mouse. Mouse clicks are generated by holding the finger still for 1 second, the mouse wheel feature is activated by stretching out all 5 fingers. Another application discussed is free hand present, which allows the user to navigate between power point slides easily. For example, stretch 2 fingers for next slide, 3 fingers for previous slide, 5 fingers to open a slide navigation menu (navigate to specific slide) Stiefelhagen, R., & Yang, J. (n.d.). Gaze tracking for multimodal human-computer interaction. 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. This paper is another technical attempt to explore multimodal HCI. This paper introduces gaze tracking and its applications. The authors state that gaze tracking can be a part of an active or a passive system. An example of a passive system is when the system can identify user’s message target by monitoring the user’s gaze. An example of an active gaze tracker would be when the user uses his/her gaze to directly control an application or launch actions. Further, a gaze tracker can be used alone or can be combined with another system like speech recognition. The gaze tracker developed in this paper is combined with speech recognition systems and estimates the 3D position and rotation (pose) of the user’s head. It has been used in 2 applications – one that helps speech recognition systems by switching language model and grammar based on user’s gaze information, and another that illustrates the combination of the gaze tracker and a speech recognizer to view a panorama image. The authors state that while multimodal interfaces offer greater flexibility and robustness, they have largely been pen or voice based, user activated and operate in settings where headsets, suits, buttons and other constraining devices are required. If more freedom is to be provided to users, some more important parameters of the communicative situation have to be identified. Early attempts for gaze trackers involved the user to wear specialized head gear or other expensive hardware. It is only recently that non-intrusive gaze trackers have been developed that leverage software (using methods like weak perspective projections). A person’s gaze direction can be determined by 2 parameters – the head orientation and the eye orientation. In the paper, the authors are talking about developing a system that only considers head orientation. The authors’ model of a gaze tracker is a non-intrusive one that tracks six facial feature points such as face, eyes, lip corners etc. The discussion of the algorithm used for identification and tracking is beyond the scope of this document. However, its applications to multimodal interfaces have been discussed below. One application could be for activating a window on a screen or directing inquiries using gaze tracking. However, one issue that comes up in such applications is of reliability of the gaze information. Even if the gaze tracker could provide high accuracy gaze information, gaze information itself may not be a reliable indicator of the action to be performed. For example, when a user sits in front of a screen, she may simply be looking at the screen without any
  • 7. expectation of an action to be performed, even though her attention is on the screen. If the tracker gauges this gaze information, it may incorrectly launch some application present on the screen. A solution is to combine the gaze with other modalities to increase reliability. Another application for gaze trackers could be to provide monitoring of eye gaze pattern, blink rate and pupil size. If any anomalies are noticed, alert signals could be reported. This is useful when monitoring employees present in air traffic controls or nuclear power plants, whose alertness and consistency of gaze is of utmost importance. Another implementation is that of the panoramic image viewer as performed in the paper. In this example, an interface has been developed that uses gaze to control the scrolling through panoramic images and uses voice commands to control the zoom. The interface receives parameters from the user’s head from the gaze tracker and parameters for spoken commands from the speech recognizer. Shah, S., Teja, J., & Bhattacharya, S. (2015). Towards affective touch interaction: Predicting mobile user emotion from finger strokes. J Interact Sci Journal of Interaction Science. This article is based on an experiment conducted by the authors that aim to make systems more responsive to the user’s needs and expectations. The authors feel that the first step towards affective interaction is to recognize user’s emotional state. They feel that the design of an application is important as it can change the user’s affective state. For example, the number of steps required to perform a task may be reduced if the user is in a happy state, so that the user experience improves. The key question here is – how can we recognize the emotional state of a user? This becomes even more challenging when the experiments are performed under ordinary circumstances – no expensive equipment or set up. Hence, the journal aims to capture affect using interactions with commonly used devices. The authors have tried to classify the emotion into 3 categories – positive (representing happy, excited etc.), negative (representing frustration, sadness, fear etc.) and neutral (representing calm, content emotions). The authors assume that touch interaction characteristics of users is an indirect indicator of the user’s emotional state. Along with research supporting this, this assumption intuitively makes sense as well. When we are happy and excited, our touch interactions will be fast, jittery and more erroneous when compared to when we are in a calm state. Hence, the authors’ work aims to detect emotions for users of mobile touch devices like smartphones and tablets. The authors use 3 finger actions in their touch interaction – down (time instance when the finger touches the screen), up (timestamp when the finger is released) and move (after down action, if the finger moves on the screen without up action, it is called move action). A tap is a combination of down and up actions whereas a strike is a combination of down, up and move actions. The strike length differentiates whether the intended action was a tap or a strike. If it is below a threshold value, it is a tap, else it is a strike. Several metrics such as deviation in number of strikes, average strike length, average strike speed, total, average and mode delay are considered in the model. The data is collected from 57 participants using common tablets. The 57 participants are broken down into training and test sets and each set is further broken down by the emotional states of the participants (positive, negative, neutral). After the setup, each participant is required to perform 7 tasks in a single session and each of metrics mentioned before are recorded. Different classification and regression techniques are applied to analyze results. The discussion that follows concludes that there are multiple ways in which the model can be used. For example, if we know the emotional state of the user, we can change the look and feel of the interface to complement the emotional state. For example, if the emotional state is negative, then we can try to display bright/happy colors to
  • 8. appease the user. We can also make changes in the way tasks are performed depending on the current emotional state. This may lead to “polite” interfaces that are empathetic, which can improve user experiences. Bramwell-Dicks, A., Petrie, H., Edwards, A., & Power, C. (n.d.). Affective Musical Interaction: Influencing Users’ Behaviour and Experiences with Music. Music and Human-Computer Interaction Springer Series on Cultural Computing, 67-83. This article, adopted from the book entitled ‘Music and Human Computer Interaction’ describes some of the research conducted in other fields that have already embraced the affective characteristic of music within their context. It also discusses the limited amount of research conducted in this field and provides potential motivations for working with affective musical interaction. The chapter begins by highlighting research conducted in non-speech audio interaction. It states that these sounds must be short in length (like a message ring on a phone) and must convey a specific meaning. Longer rings can often annoy users if played repeatedly. It moves into further discussion on the potential for music to be used for more serious tasks other than leisurely listening. For example, the authors talk about how the different genre, tempo and type of music played in supermarkets can influence people’s spending behavior. They highlight a study that was conducted by Milliman in which customers who heard slow tempo music while shopping spent more time and money compared to customers who heard fast tempo music. The authors argue that this may be extended to online shopping in which users shopping on Amazon for example, can be influenced to buy specific items and/or spend more time browsing. The authors also ask if listening to classical music while shopping for furniture online will lead to the purchase of more expensive furniture. The authors extend their discussion to fields like sports and athletic performance. According to the authors, there is evidence to suggest that sporting performance may be improved because the accompanying music acts as a distractor from the discomfort felt while performing sporting activities like running or cycling. The discussion moves towards the field of music psychology in which there is an ongoing debate between the cognitivist versus the emotivist aspects of music on psychology. The cognitivist view argues that listeners can perceive emotions from music. The emotivist view argues that music can actually change the listener’s felt emotions. The authors are more biased towards the emotivist view and raise the question if music can be included to positively enhance users’ felt emotions, especially in boring or stressful situations? Another study that has been analyzed by the authors involves the effect of music when performing activities such as typing. How does music affect speed and accuracy of typing? Does the tempo of music matter? It was found that a dirge song playing in the background reduced the speed of typing compared to no music or jazz music. The accuracy of typing increased in the Jazz case compared to the other cases. It is clear that music does affect performance of humans in a variety of settings. However, there has been little research and application in this area and the authors conclude the chapter by providing some direction for research. The authors believe that some aspects must be considered when one integrates music with interactive technology – how are users’ experiences and behaviors affected? What features of the music most affects these behaviors? Some dependent variables in the potential research may include - Can musical interfaces make stressful situations more pleasant for the user (stress as a dependent variable)? Can musical interfaces make mundane tasks more enjoyable (satisfaction as a dependent variable)? One alternate dependent variable may be the time taken to complete the tasks with music vs without music. Independent variables include elements like pitch, tempo, range, key, instrumentality (vs lyric heavy) and syncopation of the music.
  • 9. Chen, Y., Lv, M., & Guo, L. (2015). Study on Optimal Design of Digital Music Player Based on Human-computer Interaction. International Journal of Signal Processing, Image Processing and Pattern Recognition IJSIP, 135-146. This journal uses principles in HCI design to optimize the design of a digital music player. The journal does a thorough analysis on the cognitive features for interaction, hierarchy of needs and emotional information of target users to come up with a recommended humanized design. The main motivation of choosing such a journal was to understand the factors that influence the redesign of a system. Alongside, I would also like to share my thoughts on how this design can further be taken to the next level with multimodal interaction. The authors start the paper by critiquing existing digital media players. They feel that there are a number of short comings in the status quo. For example, they argue that there exist a number of redundant features on current interfaces. Many features are never even used by the target user and they only distract users. They also feel that current systems have very poor fault tolerance. For example, if a user plays a format that is not supported by the media player, the player may shut down or behave in some other unexpected way. The author states that in interaction design, the mode and sequence of options that will be presented to users must be determined and options that will influence users’ implementing and finishing tasks should be focused. An initial analysis on human cognitive features is done. The authors conclude that the most important index of judging the HCI design of the product is whether the operations of the product rationally and effectively conform to users’ cognitive inertia or expectations. Next, the hierarchy of needs in interaction design is discussed. These include sensory needs such as vision, touch, personalization needs etc. Next, an analysis on human emotion information is performed. Research shows that the emotions are manifested in the following order in HCI: sense  judgement  behavior  manifestation. Finally, the optimizations are performed on existing digital music players. Availability of functions such as completion of auditory appreciation (options to adjust bass, treble etc.), agreeableness with the layout, font styles and color contrast, touch optimized interface for mobile devices etc. are key for optimal design. Another optimization that can be performed is the expansion of the player’s use scope for special groups (users with hearing or visual impairments). Appropriate volume controls can be enforced to allow for a comfortable listening range for users. For touch interfaces, uniform touch force for user controls, uniform response speed and timely feedback are some performance metrics to be considered. Some other metrics include the number of steps taken by typical users to perform certain common tasks – like navigating to a song. Another optimization identified includes the central placement and larger size of buttons (ergonomic design optimization) on the interface to improve the user experience for users with action inaccuracy (elderly people) or users with large fingers. Yet another optimization is how well the application can handle the expectation of the user’s erroneous action. For example, better search algorithms can be employed to detect the wrong entry of a song or artist in the catalog. A thought that I have pondered on for years now is how affective computing can be used in emotional experiences of users. I imagine that digital music players could be optimized in their functioning if they can sense the user’s current affect for music recommendation. For example, majority users listen to music on their phones while performing some other activity like running, studying etc. There are many online streaming apps available in the market like Spotify and Pandora which can successfully predict musical tastes for their audience (the capability is already present). However, it can be annoying for users to change individual songs on these apps as they change their affective state, which tends to happen quite naturally overtime. It would be useful if the music could change automatically by sensing the change in user’s mood. As an example, if a jogger listening to high tempo music (affective state is excitement) wants to take temporary rest after running a considerable distance, the app should be smart enough to detect the change in affective state (changed to tiredness) and should suggest some soothing
  • 10. music that better fits the jogger’s emotional context now. However, these developments are going to be possible only with more research in affective computing and how HCI can integrate emotions in the user’s experience. Summary In conclusion, we have discussed two import aspects of Human Computer Interaction in detail – affective computing and multimodal HCI. We have seen certain applications in unimodal HCI including a bare-hand HCI using PUI to control digital displays and a gaze tracker to control digital displays using the subject’s gaze. We have discussed the possibilities of combining these unimodal interfaces with other modes such as speech recognition systems to develop multimodal systems. We have also studied affective computing by analyzing human emotions based on touch gestures on tablets. We also discussed the possibility of making music change user behavior in certain interaction contexts. Besides this, we discussed the current status of affective multimodal systems, their challenges, potential research opportunities and possibilities in the future. Lastly, we also discussed the optimization of a digital music player based on traditional HCI principles. We tried to extend our discussion by thinking of music recommendation systems that are context sensitive to the affective state of the user. It is clear that this area of research has tremendous potential and is constantly evolving as we speak. We have observed a major interest shown by companies that are integrating efforts to realize multimodal systems and we are just at the brink of this major computing revolution. Lastly, the future of HCI as a field will emerge to integrate into ubiquitous communication powered by the cloud. Each communicating device will be a high functionality system capable of multimodal interaction available in large and thin displays that will make our user experience exciting, media rich and more interpersonal than ever before.