VINO

Image Generation

Instruct Image Editing

Hover to Edit: Move your mouse over the images to execute the editing instructions.

Video Generation

A macro shot of a man in an antique scuba helmet with dark glass lenses, walking out of a colorful flower bed. The man's weathered face and rugged hands are clearly visible through the helmet. His posture is slightly stooped, and he appears to be in deep concentration. The flower bed is filled with a variety of blooming flowers, their petals soft and vibrant, creating a lush and vivid backdrop. The camera angle is from below, capturing the man's entire figure as he emerges from the flowers, with the petals gently falling around him. The image has a vintage, almost nostalgic quality, with a focus on the intricate details of both the man and the flowers. A macro shot with a slightly downward angle.

☕

A dynamic rally car speeding through a dense forest track, the wheels spinning in the muddy terrain. The car is sleek and powerful, with its hood slightly lifted due to the speed. The driver, a young man with focused intensity, grips the steering wheel tightly. His face is partially obscured by his helmet, but his eyes gleam with determination. The forest around him is lush and green, with trees towering overhead and sunlight filtering through the canopy, casting dappled shadows. Mud splashes up from the tires, creating a chaotic yet exhilarating scene. The camera angle is low, emphasizing the speed and energy of the car. The background features the rugged forest, with fallen logs and underbrush adding to the natural environment. The photo has a high-resolution, sharp texture, capturing every detail of the car and the surroundings. A low-angle shot highlighting the car's motion and the driver's intensity.

☕

A dramatic and dynamic scene in the style of a disaster movie, depicting a powerful tsunami rushing through a narrow alley in Bulgaria. The water is turbulent and chaotic, with waves crashing violently against the walls and buildings on either side. The alley is lined with old, weathered houses, their facades partially submerged and splintered. The camera angle is low, capturing the full force of the tsunami as it surges forward, creating a sense of urgency and danger. People can be seen running frantically, adding to the chaos. The background features a distant horizon, hinting at the larger scale of the tsunami. A dynamic, sweeping shot from a low-angle perspective, emphasizing the movement and intensity of the event.

☕

A vibrant and festive scene in Antarctica, where a toy robot dressed in a bright green dress and adorned with a sunny yellow sun hat takes a pleasant stroll. The robot's movements are graceful and lively, its arms swinging naturally as it explores the icy landscape. The background features a colorful and lively atmosphere, with various decorations and people in festive attire. The setting sun casts a warm glow, creating a magical and enchanting environment. The photo has a playful and whimsical style, capturing the essence of a joyful celebration. A medium shot with the robot walking towards the viewer, taken from a slightly elevated angle.

☕

A dynamic racing scene captured in the style of a high-speed action shot, featuring a powerful horse galloping out of the starting gate at the beginning of a race. The horse's mane flows freely behind it, and its hooves kick up dust as it accelerates. The jockey, dressed in traditional racing gear, holds the reins tightly and gazes determinedly ahead. The background shows blurred spectators and a distant racetrack, with the sun casting golden rays through the haze. The horse's muscles ripple with exertion, and its eyes are fixed on the finish line. A close-up from a low-angle perspective, emphasizing the horse's motion and the intensity of the moment.

☕

A dynamic high-speed train speeding out of a bustling train station, accelerating rapidly and soon reaching its top speed. The train glides smoothly along the tracks, leaving behind a blur of motion as it cuts through the air. The station platform is crowded with people waving goodbye, their faces captured in various expressions of excitement and farewell. The train’s windows reflect the bright morning sunlight, creating a sense of speed and energy. The background features a modern cityscape with tall buildings and busy streets, hinting at the fast-paced urban life. The camera angle is from the front of the train, capturing the motion and momentum as it zooms ahead.

☕

A romantic, dreamy landscape photograph set within a tranquil garden, capturing an ancient fountain that gently trickles with water. Surrounding the fountain, vibrant flowers in shades of pink, purple, and yellow bloom profusely, their petals fluttering slightly in the gentle breeze. Lush greenery, including tall ferns and dense foliage, creates a lush, verdant backdrop that seems to whisper secrets of the past. The camera angle is slightly elevated, offering a broad view of the entire scene, with the fountain at the center and the flowers and greenery framing it beautifully. The photo has a soft, ethereal quality with subtle shadows and highlights.

☕

A drone flies rapidly through a rugged, natural landscape filled with dense forests, towering mountains, and cascading waterfalls. The drone captures sweeping aerial views of lush greenery, rocky cliffs, and winding rivers. The terrain is varied and challenging, showcasing the drone's agility as it navigates through narrow valleys and over expansive meadows. The footage transitions smoothly between close-ups of flora and wide panoramic shots, emphasizing the vastness and beauty of the wilderness. The camera maintains a steady tracking shot, following the drone as it speeds through the breathtaking scenery.

☕

A gray cat standing on an office table, waving one of its legs in the air playfully. The cat has sleek fur and bright green eyes, with a curious expression. The office table is cluttered with typical office supplies such as a laptop, notebooks, and a cup of coffee. The background includes other office furniture like a desk chair and bookshelves. Medium close-up shot focusing on the cat's face and waving leg, with the office environment visible in the background.

☕

A highly detailed, cinematic scene featuring a young Steve Jobs in his early days at Apple Inc., highlighting the birth of innovation. Steve Jobs is depicted as a passionate and visionary entrepreneur, standing confidently in a modern office space filled with vintage computers and prototypes. He is dressed in his iconic jeans and black turtleneck, with a determined yet approachable expression. The room is bustling with activity, with colleagues gathered around him, discussing ideas and working on projects. The scene captures the energy and excitement of the pioneering era of personal computing, emphasizing the collaborative spirit and groundbreaking mindset of the team. The camera remains static, focusing on the interactions and expressions that convey the spirit of innovation. Wide shot.

☕

A vibrant Brazilian street scene at night, featuring a sudden flash from a camera or fireworks. The scene is bustling with lively people dancing Samba, colorful lights from street lamps and neon signs reflecting off wet cobblestone streets. People are dressed in bright costumes and casual summer wear, laughing and enjoying themselves. In the background, there are tall buildings and lush palm trees swaying gently in the breeze. The atmosphere is filled with excitement and joy, capturing the essence of Brazilian nightlife. Medium shot, static scene.

☕

A Roman philosopher, dressed in traditional togas and sandals, walks gracefully through ancient Rome at dusk. The philosopher has a thoughtful expression, holding a scroll and a staff, as he strolls along the cobblestone streets. Golden hour sunlight bathes everything in warm, glowing hues, casting soft shadows and highlighting the intricate architecture of the ancient city. The background includes bustling crowds, market stalls, and iconic landmarks such as the Colosseum and the Forum. The scene is captured in a medium shot, emphasizing the serene and reflective mood of the philosopher.

☕

Multi-Image Conditioned Generation

Generate Image 1 walking on a forest path, reaching out to touch leaves while the camera follows from the side.

☕

Generate Image 1 cutting fruit in a kitchen, filmed from the side.

☕

Generate Image 1 nodding gently and swaying slightly to the rhythm while listening to music.

☕

Generate Image 1 leaning on the railing, gazing at the river while the wind moves the edge of their coat.

☕

Generate Image 1 holding the bus handrail and swaying slightly with the vehicle's motion.

☕

Make Image 1 weave through the forest at speed, with sunlight flickering through the trees

☕

Make Image 1 sit on a glowing sled at high speed through a cyberpunk night city, neon reflections shimmering on the snow as the camera tracks the motion.

☕

Make Image_1 run through a steampunk factory with scattered metal parts and subtle mech-enhancements.

☕

Make Image_1 push a round stone slab, causing it to slide slowly across the ground.

☕

Make the person in Image_1 wearing the dress in Image_2 turn back and look at the camera.

☕

Make the person in Image_1 wearing the dress in Image_2 turn back and look at the camera.

☕

Make the person in Image_1 wearing the vest in Image_2 raise their arms for a gentle stretch indoors.

☕

Make the person in Image_1 wearing the vest in Image_2 raise their arms for a gentle stretch indoors.

☕

Image_1 lifts Image_2 to inspect the color of the liquid and gently swirls it.

☕

Make the person in Image_1 wear the headphones in Image_2 and hold the ear cups lightly to listen closer.

☕

Make the person in Image_1 wear the headphones in Image_2 and nod gently to the rhythm.

☕

Make the person in Image_1 wearing the headphones in Image_2 nod rhythmically with comic motion lines.

☕

Make the person in Image_1 hold the cosmetics in Image_2 and observe how the case reflects light.

☕

Make the person in Image_1 hold the object in Image_3 while wearing the clothing in Image_2 and briefly show it to the camera.

☕

Make the person in Image_1 hold the object in Image_3 while wearing the clothing in Image_2 and briefly show it to the camera.

☕

Image_1, wearing Image_2, hold Image_3 and briefly show it to the camera.

☕

Image_1, wearing Image_2, hold Image_3 and briefly show it to the camera.

☕

Image_1 and Image_2 sit on a park bench, passing a small sketchbook between them. One draws while the other watches closely, smiling with gentle admiration.

☕

Image_1 walks down a shaded alley, swinging Image_2 lightly at their side while tightening the strap of Image_3 over their shoulder, their movements steady and relaxed.

☕

Instruction-Based Video Editing

Edit videos through instruction.

Reference Video

☕

VINO

Transform the scene into a hand-drawn anime cel-shaded aesthetic with bold outlines, vibrant gradients, and exaggerated atmospheric depth, as if from a fantasy film.

☕

Reference Video

☕

VINO

Transform the interaction into a stylized comic book panel with bold outlines, speech bubbles, and exaggerated motion lines that emphasize the gesture and shared focus.

☕

Reference Video

☕

VINO

Render the entire video in the style of a 3D-animated stop-motion diorama, where the boombox is a meticulously crafted miniature, and the people are tiny papercraft figures in a cozy, dollhouse-sized room.

☕

Reference Video

☕

VINO

Replace the large hoop earrings with glowing crystal earrings that pulse with the room’s lighting.

☕

Reference Video

☕

VINO

Change the large glossy green leaves to translucent, glowing leaves with internal veins of light.

☕

Reference Video

☕

VINO

Replace the mustard-yellow knit beanie with a deep burgundy wool one with a pom-pom

☕

Reference Video

☕

VINO

Add a small, decorative ceramic jar filled with rose petals on the nigh stand beside the bed

☕

Reference Video

☕

VINO

Transform the scene into a surreal, surrealistic still life where the brushes are giant, living entities with personalities, and the hand is a miniature sculptor crafting masterpieces from light.

☕

Reference Video

☕

VINO

Style the video as a 3D animated fantasy where the gift boxes are portals to different worlds, and the girl is a chosen hero opening enchanted doors.

☕

Reference Video

☕

VINO

Infuse with the visual elements of the 3D Chibi style.

☕

Reference Video

☕

VINO

Render the entire video in a high-gloss, luxury fashion editorial style with dramatic lighting, soft shadows, and a diamond-dust texture overlay on the dancer’s leotard.

☕

Reference Video

☕

VINO

Let it be like the Ghibli style.

☕

Reference Video

☕

VINO

Replace the red jackets with deep indigo hooded coats with reflective zippers.

☕

Reference Video

☕

VINO

Change the fluffy white dog to a sleek silver-haired poodle with a red collar.

☕

Reference Video

☕

VINO

Render the entire video in a high-definition, surreal glass sculpture aesthetic, where every object and person appears as if carved from transparent, iridescent crystal with internal light reflections.

☕

Reference Video

☕

VINO

Replace the black long-sleeved top with a translucent gray hoodie that reveals faint outlines of the body.

☕

Reference Video

☕

VINO

Change the curly-haired individual’s hair to a flowing, silver-white cascade with faint sparkles.

☕

Reference Video

☕

VINO

Replace the sunscreen sun-shaped design on the back with a tattoo of a tropical bird.

☕

Reference Video

☕

VINO

Replace the metallic structure of the bridge with a glass and steel arch that reflects the sky and ice.

☕

Reference Video

☕

VINO

Alter the woman's hair from loose waves to a neat low bun.

☕

Image Ref Video Editing

Edit videos by providing reference image

VINO

Put the hat from the image on the man in the video

☕

VINO

Let the anime characters in the image sleep on the green space in the video

☕

VINO

Put the cowboy hat from the image on the man wearing white clothes in the video

☕

VINO

Replace the man in the video with the magical woman in the image

☕

VINO

Let the man with mask in the video wear the mask in the image

☕

VINO

Change the women's clothes in the video into those in the image

☕

VINO

Replace the woman in the video with the female character in the image

☕

VINO

Replace the woman at the back of the video with the game character in the image

☕

VINO

Transform the figures of the two girls in the video into the robust figures of the girls in the image

☕

VINO

Replace the black sunglasses on the woman's head in the video with the glasses in the image

☕

VINO

Replace the clothes of the woman in the middle of the video with those in the reference image

☕

VINO

Replace the pumpkins in the video with Halloween pumpkins in the image, and replace the surrounding environment with Halloween atmosphere

☕

VINO

Replace the clothes of the woman in the video with the red assault suit shown in the image

☕

VINO

Replace the woman in the video with the movie character in the image

☕

VINO

Replace the scene style in the video with the anime style in the image

☕

VINO

Replace the white chicken in the video with the cartoon style in the image, but keep the scene unchanged

☕

VINO

Transform the guitar played by the girl in the video into the cartoon guitar in the image

☕

VINO

Turn the Rice and vegetable roll eaten by children in the video into steamed buns in the picture

☕

VINO

Replace the white backpack in the video with the gray backpack in the image

☕

VINO

Put the gold chain from the image on the man in the video

☕

VINO

Put helmets on everyone in the video as shown in the image

☕

VINO

Put the watch from the image on the wrist closest to the camera in the video

☕

VINO

Change the video style to the post apocalyptic style in the image, and a tsunami appears in the distance

☕

VINO

Replace the house in the video with the temple in the image

☕

Video generation driven by reference video

Generative videos by providing reference video (motion/expression/camera clone)

VINO

Based on the camera motion in the video, transfer that effect to this image to animate it.

☕

VINO

Based on the camera motion in the video, transfer that effect to this image to animate it.

☕

VINO

Based on the camera motion in the video, transfer that effect to this image to animate it.

☕

VINO

Based on the camera motion in the video, transfer that effect to this image to animate it.

☕

VINO

Refer the video's camera movements, apply those effects to this image.

☕

VINO

Refer the video's camera movements, apply those effects to this image.

☕

VINO

Refer the video's camera movements, apply those effects to this image.

☕

VINO

Refer the video's camera movements, apply those effects to this image.

☕

VINO

Refer the video's camera movements, apply those effects to this image.

☕

VINO

Refer the video's camera movements, apply those effects to this image.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Refer the video's camera movements, apply those effects to this image.

☕

VINO

Refer the video's camera movements, apply those effects to this image.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

VINO

Create a video from the image that replicates the motion in the video.

☕

BibTeX

@article{chen2026vino,
    title={VINO: A Unified Visual Generator with Interleaved OmniModal Context},
    author={Chen, Junyi and He, Tong and Fu, Zhoujie and Wan, Pengfei and Gai, Kun and Ye, Weicai},
    journal={arXiv preprint arXiv:2601.02358},
    year={2026}
}