Ongoing Research · Colin Owens

Cross-Modal
Interaction

How senses work together — and what that means for the tools we build

The Question

What if you woke up one morning and discovered you could hear colors? What if certain numbers had personalities, or you could taste shapes? For a small population, this is permanent. It's called synaesthesia — from the Latin for "together" and "with the senses" — a neurological phenomenon in which stimulation of one sense automatically triggers another.

Most of us aren't synaesthetes. But we are all cross-modal. Our senses were never designed to work in isolation — they evolved together, trading information constantly, building a single coherent picture of the world from radically different kinds of input. The research question that has driven this work since 2009 is simple: if that's how perception actually works, why do we keep building tools that address one sense at a time?

Reality has no particular form. — John A. Waterworth, "Creativity and Sensation, The Case for Synaesthetic Media"

History of Cross-Modal Media

Four centuries of
color & sound

Color associations for musical notes, by theorist & year

Music & Color

Antiquity

Pythagoras & Plato — Cosmic Harmony

Pythagoras suggested that planets and stars were in cosmic harmony. Plato theorized that heavenly bodies had color and harmonic relationships according to their position in the sky. Leonardo da Vinci later sketched and wrote about the relationship of music to color, and Newton suggested a divine connection between light and sound wavelengths in his Opticks.

Early 1720s

Louis Bertrand Castel — The Color Harpsichord

Castel proposed that visible music could be produced by associating seven colors with seven keys on a harpsichord. He attached colored tape to each key and placed a candle behind the keyboard so that when each note was pressed, colors would illuminate from within. His own words: "I procured an organ, and experimented by building an attachment to the keys, which would play with different-colored lights to correspond with the music of the instrument."

1893

Bainbridge Bishop & Rimington — Color Organs

Bainbridge Bishop constructed several color organs and published a paper describing not only colors but tints and shades to represent harmony and tone. That same year, Alexander Rimington published A New Art: Colour-Music, noting his own scheme associating color according to the rainbow and describing "three novel elements: time, rhythm, and instantaneous combination." He calculated that the rate of vibration at the red end of the spectrum is approximately half that of the violet end.

1911

Scriabin, Kandinsky & Klee — Color as Sound

Alexander Scriabin associated color with tonality in Poem of Fire, juxtaposing "allied colors" arranged in spectrum with "allied tonalities" arranged in the circle of fifths. Wassily Kandinsky painted his Impression III after attending Schoenberg's concert, claiming his music-inspired works — including one named Green Sound — were meant to be "heard" by the viewer. Paul Klee, in the same year, proposed the idea of picture polyphony: "Underlying such art there must be some sort of structured order — a system of articulation and rules to be both strictly observed and departed from."

1926 onward

The Problem of Consensus

Despite centuries of experimentation, no consensus emerged on which color corresponds to which note. Color can be measured by additive, subtractive, Munsell, temperature, or saturation systems. Music scale can be measured against Western concert pitch, Indian 22-tone śruti scale, the Gamelan 5-tone sléndro scale, or many others. The thesis concludes: the separation of color is more important than the specific color chosen. Almost any system will do — as long as it distinguishes.

Motion in Space

Ancient–present

Dance — Indian Bharatanatyam & Ballet

Indian Bharatanatyam is a two-thousand-year-old form of traditional dance, carefully choreographed to the movements, changes, and rhythms of off-stage drummers and singers. Ballet, too, incorporates carefully choreographed music as part of theatrical performance. Yet because each production's choreography can vary from dancer to dancer and production to production, no repeatable, measurable synaesthetic association could be established — the connections were never scripted note-for-note and movement-for-movement.

Film Music & Sound

1890s–1920s

Silent Film — Music as Emotional Scaffold

Silent film was a natural extension of vaudeville, adding the repeatability of film to theatrical narrative. Early on it was accompanied by music — from the large orchestras of big cities to solo organ players in small towns — not only to convey emotion, but to drown out the projector's mechanical noise. Music ranged from purely improvisational solo performance to prearranged stock scores jury-rigged to each film, to rare continuous scores composed specifically for one film by a composer working closely with the director.

1926–1927

The Vitaphone & The Jazz Singer

Warner Bros.' Vitaphone synchronized recorded sound with film via a rotating disc — a direct descendant of Edison's cylinder — at precisely the moment when live music at film theatres was at its artistic peak. The 1927 film The Jazz Singer contained only marginal dialogue, but its popular musical content and Al Jolson's famous ad lib "Wait a minute, you ain't heard nothin' yet!" won audiences over and prompted theaters nationwide to wire for sound, putting theatre musicians out of work almost overnight. Fox's competing "sound-on-film" technology — an optically read track placed alongside the image — was the version all studios would eventually adopt.

1931

Fritz Lang's M — Sound as Narrative

Fritz Lang's M marked the first time sound — specifically music — played the most crucial narrative role in a film. The killer whistles Grieg's In the Hall of the Mountain King throughout the story, and is ultimately identified by a blind balloon seller by that very same whistle. If this were a silent film, the cue would have been lost entirely.

1937–1938

Eisenstein & Prokofiev — Montage & Peter and the Wolf

Sergei Eisenstein's "montage" editing style — cutting short sequences of images to show the passage of time — gave music a primary role in delivering narrative. His collaboration with Prokofiev on Alexander Nevsky (1937) produced the first time an audience saw a battle sequence edited to fit a pre-composed score, with the storyboard pictured alongside the musical staff. The previous year, Prokofiev had completed Peter and the Wolf, assigning each character a dedicated instrument: Peter a string quartet, the bird a flute, the duck an oboe, the cat a clarinet. Each instrument responded to spoken narration as a form of dialogue — a direct, repeatable audiovisual correspondence.

1941

Disney's Fantasia — Synthesis

Critic Paolo Milano (1941) described a spectrum for the relationship between image and music in film: at one end, neutral music subordinates to dominant image; at the other, neutral images subordinate to powerful music. In the middle lies "counterpoint" — aesthetic equality between image and music — what he called medium synthesis. Disney's Fantasia falls squarely in this middle ground, giving credence to neither sense as the dominant narrative force.

Visual Music

1930s

Fischinger — Animation as Music

Oskar Fischinger created animations responding to prerecorded musical compositions: sequences of shapes in various colors flew, stuttered, and disappeared according to the movements of the music. Groups of triangles represented string sections; major movements changed scenes and ushered in new colored shapes. Along with Mary Ellen Bute and Norman McLaren, these were the earliest animated visual representations of music — painstakingly drawn and painted by hand, then shot onto film frame by frame.

1932

Pfenninger & Fischinger — Hand-Drawn Sound

Rudolf Pfenninger discovered that drawing wave-like forms onto the optical margin of film would produce pure sine wave tones when played back — similar to tone wheel organs, but more electronic. He developed this technique because he wanted music but couldn't afford to record musicians. Fischinger expanded the method and coined the term "hand-drawn sound." Laszlo Moholy-Nagy then photographed this hand-drawn sound as the visual counter-track in his piece Tönendes, so one could "see the same forms that one was also hearing" — arguably the most pure form of visualizing sound, though critics found the combinations "mechanical, almost soul-less."

1949

John & James Whitney — Visual Harmonies

As students, John and James Whitney created a series of music and image animations for the 1949 Experimental Film Competition in Belgium using a single score technique to synchronize sound and image. In their studio they linked a 16mm optical printer with an instrument constructed of pendulum bells via an optical wedge — a valve that recorded the pendulum's shape as sound. At the heart of their work was the concept of visual harmonies: creating shapes and lines with light that correspond to musical harmonies using the temporality of film, searching for analogies in the natural world and in physics.

Video

1961

Nam June Paik — Television into the Gallery

Nam June Paik brought television into the gallery, creating multi-screen video sculptures that ran loops on dilapidated TV monitors. His 1961 TV Bra for Living Sculpture combined two small television screens attached as a bra to violist Charlotte Moorman — as she played, the images modulated to the tones of the music, one of the earliest dynamic audiovisual pieces.

1968

Terry Riley — Music With Balls

Terry Riley's Music With Balls combined video, kinetic sculpture, and music. His prerecorded saxophone and organ tones were played back on speakers embedded in two large spinning black spheres with a silver pendulum in the middle. The circular motion of music and sculpture was filmed, transferred to video tape, and edited — a truly synaesthetic work dealing with the acoustics and vision of physical space combined with the motion of a kinetic structure.

Music Video

1986–1991

Peter Gabriel & The Art of Noise

As music video matured, a few creators extended its language beyond simple band performance. Peter Gabriel's 1986 video for Sledgehammer used stop-motion animation to illustrate the text of the lyrics. The Art of Noise's video for Close to the Edit utilized Fischinger-style animation alongside live action to accompany the audio — some of the earliest instances of visual music aesthetics entering the mainstream pop format.

Mid-1990s

Emergency Broadcast Network — The Video Sampler

EBN used sampled video footage from political propaganda to create lyrics and musical accompaniment, making work that was every bit as visual as it was auditory. Each musical sample had a video complement as its source. They later developed a first-of-its-kind video sampler that, when played through a keyboard, would display the companion video to the audio — a natural evolution toward the precise audiovisual timing the computer would eventually enable.

1997–2002

Michael Gondry — The Language of Visual Music

Gondry's 1997 video for Daft Punk's Around the World was an homage to Fischinger as interpretive dance: each group of dancers represented a specific instrument, descending staircases in synchrony with the melody line. His 2002 Star Guitar appeared to be a continuous shot from a passenger train, with catenary poles marking the rhythm track and retaining walls signifying the introduction or ending of a synth pad — a meticulously edited recombination based on a pre-composed visual score.

Computer Sound & Image

1970s–present

Video Games — Pong to Halo

Perhaps the best everyday examples of software emulating real-world perception are video games. Pong, the first commercially available video game, used sound as direct feedback: a successful paddle produced one bell tone, hitting a wall produced another, and a missed pong a higher bell still. Modern first-person games like Halo 3 place noisemakers in natural 3D space relative to the player's current position — though the monitor remains a two-dimensional surface displaying three dimensions with limited peripheral viewing.

2009

The Mixing Interface — A Problem Unsolved

The layout of a digital multitrack mixer closely resembles its analog counterpart from decades earlier. The long rectangular channel module harks back to the days when recording console channels were physically removable and modular. Yet in the computer, channel position has no relevance to the position of the mix. Track one could be playing back on track five. There is no physical separation. The thesis asks: why, when MIDI programs evolved a natural piano-roll notation and recording windows evolved a natural time-amplitude display, did the mixing window remain a faithful — and unnecessary — copy of its analog ancestor?

2009

Toward a Natural Interface — This Thesis

The computer presents an excellent opportunity to combine sound and moving image in a repeatable, precise, and reliable method. Tools that already used the computer's native language — MIDI piano roll, 3D spatial panners, touch-based interfaces like JazzMutant's Lemur, Golan Levin and Zack Lieberman's Manual Input Stations, the ReacTable project at Universitat Pompeu Fabra — all pointed toward the same conclusion: natural, physical metaphors grounded in how eyes and ears actually work together are more useful, and more creative, than inherited mechanical ones.

Hearing & Seeing as Complement

Our eyes evolved
for space.
Our ears, for time.

Both the eyes and ears began as competing senses for orientation. The eyes developed in single-celled organisms as a way of detecting sunlight in oceans. The ears developed from an orientation sense still retained in our inner ear. Though they evolved separately, both give the brain cues for how we perceive the world — and both can be fooled by the other.

The Eye

Spatial resolution
Temporal resolution
Direction finding
Pattern recall
Depth perception

The Ear

Spatial resolution
Temporal resolution
Direction finding
Pattern recall
Pitch discrimination

The McGurk Effect (McGurk and MacDonald, 1976) demonstrates this cross-influence directly: subjects presented with video of a person mouthing "ga-ga" while the audio track played "ba-ba" perceived neither — instead hearing "da-da." Close your eyes, and you heard "ba-ba." Watch without sound, and you saw "ga-ga." Together, the senses created a third reality.

When subjects are asked to judge the audio quality of an audiovisual stimulus, the video quality will contribute significantly to the subjectively perceived audio quality. — AES Paper, Beerends & De Caluwe, 1999

Cross-Modal Effects

Every audio effect
has a visual twin.

If there are fundamental connections between light and sound waves, then effects that alter one sense should have natural analogs in the other. The following pairs show how audio mixing concepts can be grounded in observable physics.

Pan / Localization

Position in stereo space corresponds directly to visual position along the horizontal axis — the same way we locate objects in physical space using both eyes and both ears.

Volume / Height

Height on the Y axis represents volume. "Up in the mix" is already idiomatic — a balloon rising to its peak in the visual world.

Reverb / Light Diffusion

Reverberation — many scattered reflections diffusing in space — is identical in principle to how a soft-box diffuses light: the same source, scattered in many directions, softening harsh edges.

Compression / Density

Compression raises low signals and lowers high ones — squeezing dynamic range. Visually, a compressed object is denser: the same mass, visibly more compact. A squeezed beach ball contains the same air.

Phase / Light Interference

When two identical frequencies arrive simultaneously, they cancel out (phase cancellation). When light passes through two pinholes and projects onto a screen, the same pulsation appears — coherence.

Delay / Visual Afterimage

Audio delay is a repetition of sound at diminished volume. The Pulfrich effect produces the visual equivalent: a dimmed eye perceives a pendulum as lagging behind its true position — an afterimage in time.

Primary Output · 2007–2012

ShapeMix

The first major output of this research was a software environment — and eventually a company — built directly from the cross-modal framework. Each sound is a shape in physical space. Each shape is a sound. Horizontal position controls stereo pan. Vertical position controls volume. Physics — gravity, spring repulsion, collision detection — govern movement. Every audio effect has a visual twin.

The thesis made real: Shapemix raised $2M, launched four iOS apps, and earned a patent (US20110271186A1). The SPIN Magazine co-branded remixing contest reached an audience of tens of thousands. But more than the business outcome, the product proved the core proposition — that cross-modal interaction is not a novelty, but a more natural way to work with sound.

The interactive demo below is a functional prototype of the core ShapeMix interaction. Drag shapes — left↔right is stereo pan, up↓down is volume. Double-click to mute.

Drag · left/right = pan · up/down = vol · double-click = mute

What's Next

The question
stays open.

The computer is merely a vehicle to enable the theory that there are connections between light and sound waves — and that we behave in particular ways in space.

The original research proposed that natural, physical environments can pave the way for more useful — and more creative — computing. That proposition has only become more urgent. The interfaces we build today are still largely divorced from how perception actually works: we read, we click, we type. The body is mostly absent.

Two threads are currently active. The first is a VR rebuild of ShapeMix on Meta Quest using hand tracking — removing the screen entirely and placing sound objects directly in three-dimensional space, where the cross-modal mappings become literal rather than metaphorical. The second is the question of what cross-modal interaction means when one of the parties is an AI system: what does it mean for a non-human collaborator to share a perceptual space with a human one?

These are not separate questions. They converge on the same problem that has driven this work since 2009: we build tools around how computers work. We should build them around how people perceive.

There can be no completely intimate visible and audible music until audiovisual unison is achieved. — Ralph K. Potter
colin@aboutface.io ↗