Ongoing Research · Colin Owens
How senses work together — and what that means for the tools we build
The Question
What if you woke up one morning and discovered you could hear colors? What if certain numbers had personalities, or you could taste shapes? For a small population, this is permanent. It's called synaesthesia — from the Latin for "together" and "with the senses" — a neurological phenomenon in which stimulation of one sense automatically triggers another.
Most of us aren't synaesthetes. But we are all cross-modal. Our senses were never designed to work in isolation — they evolved together, trading information constantly, building a single coherent picture of the world from radically different kinds of input. The research question that has driven this work since 2009 is simple: if that's how perception actually works, why do we keep building tools that address one sense at a time?
Reality has no particular form. — John A. Waterworth, "Creativity and Sensation, The Case for Synaesthetic Media"
History of Cross-Modal Media
Color associations for musical notes, by theorist & year
Antiquity
Pythagoras & Plato — Cosmic Harmony
Pythagoras suggested that planets and stars were in cosmic harmony. Plato theorized that heavenly bodies had color and harmonic relationships according to their position in the sky. Leonardo da Vinci later sketched and wrote about the relationship of music to color, and Newton suggested a divine connection between light and sound wavelengths in his Opticks.
Early 1720s
Louis Bertrand Castel — The Color Harpsichord
Castel proposed that visible music could be produced by associating seven colors with seven keys on a harpsichord. He attached colored tape to each key and placed a candle behind the keyboard so that when each note was pressed, colors would illuminate from within. His own words: "I procured an organ, and experimented by building an attachment to the keys, which would play with different-colored lights to correspond with the music of the instrument."
1893
Bainbridge Bishop & Rimington — Color Organs
Bainbridge Bishop constructed several color organs and published a paper describing not only colors but tints and shades to represent harmony and tone. That same year, Alexander Rimington published A New Art: Colour-Music, noting his own scheme associating color according to the rainbow and describing "three novel elements: time, rhythm, and instantaneous combination." He calculated that the rate of vibration at the red end of the spectrum is approximately half that of the violet end.
1911
Scriabin, Kandinsky & Klee — Color as Sound
Alexander Scriabin associated color with tonality in Poem of Fire, juxtaposing "allied colors" arranged in spectrum with "allied tonalities" arranged in the circle of fifths. Wassily Kandinsky painted his Impression III after attending Schoenberg's concert, claiming his music-inspired works — including one named Green Sound — were meant to be "heard" by the viewer. Paul Klee, in the same year, proposed the idea of picture polyphony: "Underlying such art there must be some sort of structured order — a system of articulation and rules to be both strictly observed and departed from."
1926 onward
The Problem of Consensus
Despite centuries of experimentation, no consensus emerged on which color corresponds to which note. Color can be measured by additive, subtractive, Munsell, temperature, or saturation systems. Music scale can be measured against Western concert pitch, Indian 22-tone śruti scale, the Gamelan 5-tone sléndro scale, or many others. The thesis concludes: the separation of color is more important than the specific color chosen. Almost any system will do — as long as it distinguishes.
Ancient–present
Dance — Indian Bharatanatyam & Ballet
Indian Bharatanatyam is a two-thousand-year-old form of traditional dance, carefully choreographed to the movements, changes, and rhythms of off-stage drummers and singers. Ballet, too, incorporates carefully choreographed music as part of theatrical performance. Yet because each production's choreography can vary from dancer to dancer and production to production, no repeatable, measurable synaesthetic association could be established — the connections were never scripted note-for-note and movement-for-movement.
1890s–1920s
Silent Film — Music as Emotional Scaffold
Silent film was a natural extension of vaudeville, adding the repeatability of film to theatrical narrative. Early on it was accompanied by music — from the large orchestras of big cities to solo organ players in small towns — not only to convey emotion, but to drown out the projector's mechanical noise. Music ranged from purely improvisational solo performance to prearranged stock scores jury-rigged to each film, to rare continuous scores composed specifically for one film by a composer working closely with the director.
1926–1927
The Vitaphone & The Jazz Singer
Warner Bros.' Vitaphone synchronized recorded sound with film via a rotating disc — a direct descendant of Edison's cylinder — at precisely the moment when live music at film theatres was at its artistic peak. The 1927 film The Jazz Singer contained only marginal dialogue, but its popular musical content and Al Jolson's famous ad lib "Wait a minute, you ain't heard nothin' yet!" won audiences over and prompted theaters nationwide to wire for sound, putting theatre musicians out of work almost overnight. Fox's competing "sound-on-film" technology — an optically read track placed alongside the image — was the version all studios would eventually adopt.
1931
Fritz Lang's M — Sound as Narrative
Fritz Lang's M marked the first time sound — specifically music — played the most crucial narrative role in a film. The killer whistles Grieg's In the Hall of the Mountain King throughout the story, and is ultimately identified by a blind balloon seller by that very same whistle. If this were a silent film, the cue would have been lost entirely.
1937–1938
Eisenstein & Prokofiev — Montage & Peter and the Wolf
Sergei Eisenstein's "montage" editing style — cutting short sequences of images to show the passage of time — gave music a primary role in delivering narrative. His collaboration with Prokofiev on Alexander Nevsky (1937) produced the first time an audience saw a battle sequence edited to fit a pre-composed score, with the storyboard pictured alongside the musical staff. The previous year, Prokofiev had completed Peter and the Wolf, assigning each character a dedicated instrument: Peter a string quartet, the bird a flute, the duck an oboe, the cat a clarinet. Each instrument responded to spoken narration as a form of dialogue — a direct, repeatable audiovisual correspondence.
1941
Disney's Fantasia — Synthesis
Critic Paolo Milano (1941) described a spectrum for the relationship between image and music in film: at one end, neutral music subordinates to dominant image; at the other, neutral images subordinate to powerful music. In the middle lies "counterpoint" — aesthetic equality between image and music — what he called medium synthesis. Disney's Fantasia falls squarely in this middle ground, giving credence to neither sense as the dominant narrative force.
1930s
Fischinger — Animation as Music
Oskar Fischinger created animations responding to prerecorded musical compositions: sequences of shapes in various colors flew, stuttered, and disappeared according to the movements of the music. Groups of triangles represented string sections; major movements changed scenes and ushered in new colored shapes. Along with Mary Ellen Bute and Norman McLaren, these were the earliest animated visual representations of music — painstakingly drawn and painted by hand, then shot onto film frame by frame.
1932
Pfenninger & Fischinger — Hand-Drawn Sound
Rudolf Pfenninger discovered that drawing wave-like forms onto the optical margin of film would produce pure sine wave tones when played back — similar to tone wheel organs, but more electronic. He developed this technique because he wanted music but couldn't afford to record musicians. Fischinger expanded the method and coined the term "hand-drawn sound." Laszlo Moholy-Nagy then photographed this hand-drawn sound as the visual counter-track in his piece Tönendes, so one could "see the same forms that one was also hearing" — arguably the most pure form of visualizing sound, though critics found the combinations "mechanical, almost soul-less."
1949
John & James Whitney — Visual Harmonies
As students, John and James Whitney created a series of music and image animations for the 1949 Experimental Film Competition in Belgium using a single score technique to synchronize sound and image. In their studio they linked a 16mm optical printer with an instrument constructed of pendulum bells via an optical wedge — a valve that recorded the pendulum's shape as sound. At the heart of their work was the concept of visual harmonies: creating shapes and lines with light that correspond to musical harmonies using the temporality of film, searching for analogies in the natural world and in physics.
1961
Nam June Paik — Television into the Gallery
Nam June Paik brought television into the gallery, creating multi-screen video sculptures that ran loops on dilapidated TV monitors. His 1961 TV Bra for Living Sculpture combined two small television screens attached as a bra to violist Charlotte Moorman — as she played, the images modulated to the tones of the music, one of the earliest dynamic audiovisual pieces.
1968
Terry Riley — Music With Balls
Terry Riley's Music With Balls combined video, kinetic sculpture, and music. His prerecorded saxophone and organ tones were played back on speakers embedded in two large spinning black spheres with a silver pendulum in the middle. The circular motion of music and sculpture was filmed, transferred to video tape, and edited — a truly synaesthetic work dealing with the acoustics and vision of physical space combined with the motion of a kinetic structure.
1986–1991
Peter Gabriel & The Art of Noise
As music video matured, a few creators extended its language beyond simple band performance. Peter Gabriel's 1986 video for Sledgehammer used stop-motion animation to illustrate the text of the lyrics. The Art of Noise's video for Close to the Edit utilized Fischinger-style animation alongside live action to accompany the audio — some of the earliest instances of visual music aesthetics entering the mainstream pop format.
Mid-1990s
Emergency Broadcast Network — The Video Sampler
EBN used sampled video footage from political propaganda to create lyrics and musical accompaniment, making work that was every bit as visual as it was auditory. Each musical sample had a video complement as its source. They later developed a first-of-its-kind video sampler that, when played through a keyboard, would display the companion video to the audio — a natural evolution toward the precise audiovisual timing the computer would eventually enable.
1997–2002
Michael Gondry — The Language of Visual Music
Gondry's 1997 video for Daft Punk's Around the World was an homage to Fischinger as interpretive dance: each group of dancers represented a specific instrument, descending staircases in synchrony with the melody line. His 2002 Star Guitar appeared to be a continuous shot from a passenger train, with catenary poles marking the rhythm track and retaining walls signifying the introduction or ending of a synth pad — a meticulously edited recombination based on a pre-composed visual score.
1970s–present
Video Games — Pong to Halo
Perhaps the best everyday examples of software emulating real-world perception are video games. Pong, the first commercially available video game, used sound as direct feedback: a successful paddle produced one bell tone, hitting a wall produced another, and a missed pong a higher bell still. Modern first-person games like Halo 3 place noisemakers in natural 3D space relative to the player's current position — though the monitor remains a two-dimensional surface displaying three dimensions with limited peripheral viewing.
2009
The Mixing Interface — A Problem Unsolved
The layout of a digital multitrack mixer closely resembles its analog counterpart from decades earlier. The long rectangular channel module harks back to the days when recording console channels were physically removable and modular. Yet in the computer, channel position has no relevance to the position of the mix. Track one could be playing back on track five. There is no physical separation. The thesis asks: why, when MIDI programs evolved a natural piano-roll notation and recording windows evolved a natural time-amplitude display, did the mixing window remain a faithful — and unnecessary — copy of its analog ancestor?
2009
Toward a Natural Interface — This Thesis
The computer presents an excellent opportunity to combine sound and moving image in a repeatable, precise, and reliable method. Tools that already used the computer's native language — MIDI piano roll, 3D spatial panners, touch-based interfaces like JazzMutant's Lemur, Golan Levin and Zack Lieberman's Manual Input Stations, the ReacTable project at Universitat Pompeu Fabra — all pointed toward the same conclusion: natural, physical metaphors grounded in how eyes and ears actually work together are more useful, and more creative, than inherited mechanical ones.
Hearing & Seeing as Complement
Both the eyes and ears began as competing senses for orientation. The eyes developed in single-celled organisms as a way of detecting sunlight in oceans. The ears developed from an orientation sense still retained in our inner ear. Though they evolved separately, both give the brain cues for how we perceive the world — and both can be fooled by the other.
The McGurk Effect (McGurk and MacDonald, 1976) demonstrates this cross-influence directly: subjects presented with video of a person mouthing "ga-ga" while the audio track played "ba-ba" perceived neither — instead hearing "da-da." Close your eyes, and you heard "ba-ba." Watch without sound, and you saw "ga-ga." Together, the senses created a third reality.
When subjects are asked to judge the audio quality of an audiovisual stimulus, the video quality will contribute significantly to the subjectively perceived audio quality. — AES Paper, Beerends & De Caluwe, 1999
Cross-Modal Effects
If there are fundamental connections between light and sound waves, then effects that alter one sense should have natural analogs in the other. The following pairs show how audio mixing concepts can be grounded in observable physics.
Pan / Localization
Position in stereo space corresponds directly to visual position along the horizontal axis — the same way we locate objects in physical space using both eyes and both ears.
Volume / Height
Height on the Y axis represents volume. "Up in the mix" is already idiomatic — a balloon rising to its peak in the visual world.
Reverb / Light Diffusion
Reverberation — many scattered reflections diffusing in space — is identical in principle to how a soft-box diffuses light: the same source, scattered in many directions, softening harsh edges.
Compression / Density
Compression raises low signals and lowers high ones — squeezing dynamic range. Visually, a compressed object is denser: the same mass, visibly more compact. A squeezed beach ball contains the same air.
Phase / Light Interference
When two identical frequencies arrive simultaneously, they cancel out (phase cancellation). When light passes through two pinholes and projects onto a screen, the same pulsation appears — coherence.
Delay / Visual Afterimage
Audio delay is a repetition of sound at diminished volume. The Pulfrich effect produces the visual equivalent: a dimmed eye perceives a pendulum as lagging behind its true position — an afterimage in time.
Primary Output · 2007–2012
The first major output of this research was a software environment — and eventually a company — built directly from the cross-modal framework. Each sound is a shape in physical space. Each shape is a sound. Horizontal position controls stereo pan. Vertical position controls volume. Physics — gravity, spring repulsion, collision detection — govern movement. Every audio effect has a visual twin.
The thesis made real: Shapemix raised $2M, launched four iOS apps, and earned a patent (US20110271186A1). The SPIN Magazine co-branded remixing contest reached an audience of tens of thousands. But more than the business outcome, the product proved the core proposition — that cross-modal interaction is not a novelty, but a more natural way to work with sound.
The interactive demo below is a functional prototype of the core ShapeMix interaction. Drag shapes — left↔right is stereo pan, up↓down is volume. Double-click to mute.
What's Next
The original research proposed that natural, physical environments can pave the way for more useful — and more creative — computing. That proposition has only become more urgent. The interfaces we build today are still largely divorced from how perception actually works: we read, we click, we type. The body is mostly absent.
Two threads are currently active. The first is a VR rebuild of ShapeMix on Meta Quest using hand tracking — removing the screen entirely and placing sound objects directly in three-dimensional space, where the cross-modal mappings become literal rather than metaphorical. The second is the question of what cross-modal interaction means when one of the parties is an AI system: what does it mean for a non-human collaborator to share a perceptual space with a human one?
These are not separate questions. They converge on the same problem that has driven this work since 2009: we build tools around how computers work. We should build them around how people perceive.
There can be no completely intimate visible and audible music until audiovisual unison is achieved. — Ralph K. Potter