Sagie Benaim

I am an Assistant Professor (Senior Lecturer) at the School of Computer Science and Engineering at the Hebrew University of Jerusalem. Previously, I was a postdoc at DIKU working with Prof. Serge Belongie and a member of the Pioneer Center for AI. Before that, I was a PhD candidate at Tel Aviv University, working in the Deep Learning Lab under the supervision of Prof. Lior Wolf.

My research interests are in computer vision, machine learning and computer graphics. In particular, I am interested in generative models, neural based signal representations and inverse graphics. I am also interested in how disentangled and generative models can be used to better understand the visual world.

I am recruiting M.Sc. and Ph.D. students. If you are interested in wokring with me, please contact me!

Email  /  CV  /  Github  /  LinkedIn  /  Google Scholar  /  Twitter

profile photo
Research

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi.
AAAI, 2024. project page / arXiv / code

We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by a pretrained text-to-video generation model.



Volumetric Disentanglement for 3D Scene Manipulation
Sagie Benaim, Frederik Warburg, Peter Ebert Christensen, Serge Belongie
WACV, 2024. project page / arXiv

We propose a framework for disentangling a 3D scene into a forground and background volumetric representations and show a variety of downstream applications involving 3D manipulation.



Assessing Neural Network Robustness via Adversarial Pivotal Tuning
Peter Ebert Christensen, Vésteinn Snæbjarnarson, Andrea Dittadi, Serge Belongie, Sagie Benaim
WACV, 2024.   (Oral Presentation)
project page / arXiv

We propose a framework for generation of photorealistic images that can fool a classifier using automatic semantic manipulations.


Reward Finetuning for Faster and More Accurate Unsupervised Object Discovery
Katie Z Luo, Zhenzhen Liu, Xiangyu Chen, Yurong You, Sagie Benaim, Cheng Perng Phoo, Mark Campbell, Wen Sun, Bharath Hariharan, Kilian Q Weinberger.
NeurIPS, 2023. arXiv

We propose to adapt similar RL-based methods to Reinforcement Learning from Human Feedback (RLHF) for the task of unsupervised object discovery, i.e. learning to detect objects from LiDAR points without any training labels.



Discriminative Class Tokens for Text-to-Image Diffusion Models
Idan Schwartz*, Vésteinn Snæbjarnarson*, Hila Chefer, Ryan Cotterell, Serge Belongie, Lior Wolf, Sagie Benaim.
*Equal contribution. ICCV, 2023. project page / arXiv / code

We propose a non-invasive fine-tuning technique fro Text-to-Image Diffusion Models that capitalizes on the expressive potential of free-form text while achieving high accuracy through discriminative signals from a pretrained classifier, which guides the generation. This is done by iteratively modifying the embedding of a single input token of a text-to-image diffusion model, using the classifier, by steering generated images toward a given target class.


Polynomial Neural Fields for Subband Decomposition
Guandao Yang*, Sagie Benaim*, Varun Jampani, Kyle Genova, Jonathan T. Barron, Thomas Funkhouser, Bharath Hariharan, Serge Belongie. *Equal contribution.
NeurIPS, 2022. arXiv

We propose a new class of neural fields called basis-encoded polynomial neural fields (PNFs). The key advantage of a PNF is that it can represent a signal as a composition of a number of manipulable and interpretable components without losing the merits of neural fields representation.

Image-Based CLIP-Guided Essence Transfer
Hila Chefer, Sagie Benaim, Roni Paiss, Lior Wolf
ECCV, 2022. arXiv / code / 5 minute summary

A new style (essense) transfer method that incoporates higher level abstractions then textures and colors. TargetCLIP introduces a blending operator that combines the powerful StyleGAN2 generator with a semantic network CLIP to achieve a more natural blending than with each model separately.



Text-Driven Stylization of Video Objects
Sebastian Loeschcke, Serge Belongie, Sagie Benaim
ECCV Workshop on AI for Creative Video Editing and Understanding, 2022.   (Best Paper Award)
arXiv / project page

A method for stylizing video objects in an intuitive and semantic manner following a user-specified text prompt.

Text2Mesh: Text-Driven Neural Stylization for Meshes
Oscar Michel*, Roi Bar-On*, Richard Liu*, Sagie Benaim, Rana Hanocka.
CVPR, 2022.   (Oral Presentation)
project page / arXiv / code

Text2Mesh produces color and geometric details over a variety of source meshes, driven by a target text prompt. Our stylization results coherently blend unique and ostensibly unrelated combinations of text, capturing both global semantics and part-aware attributes.

JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting
Ron Mokady, Rotem Tzaban, Sagie Benaim, Amit Bermano, Daniel Cohen-Or
Computer Graphics Forum, 2022. project page / arXiv / code

We introduce JOKR - a JOint Keypoint Representation that captures the motion common to both the source and target videos, without requiring any object prior or data collection. This geometry-driven representation allows for unsupervised motion retargeting in a variery of challenging situations as well as for further intuitive control, such as temporal coherence and manual editing.



FewGAN: Generating from the Joint Distribution of a Few Images
Lior Ben Moshe, Sagie Benaim, Lior Wolf
ICIP, 2022. arXiv

FewGAN is a generative model for generating novel, high-quality and diverse images whose patch distribution lies in the joint patch distribution of a small number of N training samples.


Locally Shifted Attention With Early Global Integration
Shelly Sheynin, Sagie Benaim, Adam Polyak, Lior Wolf
arXiv, 2021. arXiv / code

A new image transformer architecture which first applies a local attention over patches and their local shifts, resulting in virtually located local patches, which are not bound to a single, specific location. Subsequently, these virtually located patches are used in a global attention layer.

A Hierarchical Transformation-Discriminating Generative Model for Few Shot Anomaly Detection
Shelly Sheynin*, Sagie Benaim*, Lior Wolf. *Equal Contribution.
ICCV, 2021. project page / arXiv / code

We consider the setting of few-shot anomaly detection in images, where only a few images are given at training. We devise a hierarchical generative model that captures the multi-scale patch distribution of each training image. We further enhance the representation of our model by using image transformations and optimize scale-specific patch-discriminators to distinguish between real and fake patches of the image, as well as between different transformations applied to those patches.

Permuted AdaIN: Reducing the Bias Towards Global Statistics in Image Classification
Oren Nuriel, Sagie Benaim, Lior Wolf
CVPR, 2021. arXiv / code

A simple architectural change which forces the network to reduce its bias to global image statistics. Using AdaIN, we swap global statistics of samples within a batch, stocastically, with some probability p. This results in significant improvements in multiple settings including domain adaptation, domain generalization, robustness and image classification.



Identity and Attribute Preserving Thumbnail Upscaling
Noam Gat, Sagie Benaim, Lior Wolf
ICIP, 2021. arXiv / code

StyleGAN can be used to upscale a low resolution thumbnail image of a person, to a higher resolution image. However, it often changes the person’s identity, or produces biased solutions, such as Caucasian faces. We present a method to upscale an image that preserves the person's identity and other attributes.


Risk Bounds for Unsupervised Cross-Domain Mapping with IPMs
Tomer Galanti, Sagie Benaim, Lior Wolf
Journal of Machine Learning Research (JMLR), 2021. arXiv

We develop theoretical foundations for the success of unsupervised cross-domain mapping algorithms, in mapping between two domains that share common characteristics, with a particular emphasis on the clear ambiguity in such mappings.


Evaluation Metrics for Conditional Image Generation
Yaniv Beniv, Tomer Galanti, Sagie Benaim, Lior Wolf
International Journal of Computer Vision (IJCV), 2020. arXiv

Two new metrics for evaluating generative models in the class-conditional image generation setting. These metrics are obtained by generalizing the two most popular unconditional metrics: the Inception Score (IS) and the Fréchet Inception Distance (FID).

Structural-analogy from a Single Image Pair
Sagie Benaim*, Ron Mokday*, Amit Bermano, Daniel Cohen-Or, Lior Wolf. *Equal contribution.
Computer Graphics Forum, 2020.
Also in the Deep Internal Learning workshop, ECCV 2020.
project page / arXiv / code / video

We explore the capabilities of neural networks to understand image structure given only a single pair of images, A and B. We seek to generate images that are structurally aligned: that is, to generate an image that keeps the appearance and style of B, but has a structural arrangement that corresponds to A. Our method can be used for: guided image synthesis, style and texture transfer, text translation as well as video translation.


Hierarchical Patch VAE-GAN: Generating Diverse Videos from a Single Sample
Shir Gur*, Sagie Benaim*, Lior Wolf. *Equal contribution.
NeurIPS, 2020.
Also in the Deep Internal Learning workshop, ECCV 2020.
project page / arXiv / code / video

We consider the task of generating diverse and novel videos from a single video sample. We introduce a novel patch-based variational autoencoder (VAE) which allows for a much greater diversity in generation. Using this tool, a new hierarchical video generation scheme is constructed resulting in diverse and high quality videos.


SpeedNet: Learning the Speediness in Videos
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, Tali Dekel
CVPR, 2020.   (Oral Presentation)
project page / arXiv / video

We train a network called SpeedNet to to automatically predict the "speediness" of moving objects in videos - whether they move faster, at, or slower than their "natural" speed. SpeedNet is trained in a self-supervised manner and can be used to generate time-varying, adaptive video speedups as well as to boost the performance of self-supervised action recognition and video retrieval.

Masked Based Unsupervised Content Transfer
Ron Mokday, Sagie Benaim, Amit Bermano, Lior Wolf
ICLR, 2020.  
arXiv / code / video

We consider the problem of translating, in an unsupervised manner, between two domains where one contains some additional information compared to the other. To do so, we disentangle the common and separate parts of these domains and, through the generation of a mask, focuses the attention of the underlying network to the desired augmentation, without wastefully reconstructing the entire target.

Domain Intersection and Domain Difference
Sagie Benaim, Michael Khaitov, Tomer Galanti, Lior Wolf
ICCV, 2019.
arXiv / code

We present a method for recovering the shared content between two visual domains as well as the content that is unique to each domain. This allows us to remove content specific content of the first domain and add content specific to the second domain. We can also generate form the intersection of the two domains and their union, despite having no such samples during training.



Semi-Supervised Monaural Singing Voice Separation With a Masking Network Trained on Synthetic Mixtures
Michael Michelashvili, Sagie Benaim, Lior Wolf
ICASSP, 2019.
arXiv / code / samples

We study the problem of semi-supervised singing voice separation, in which the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music. Our results indicate that we are on a par with or better than fully supervised methods, which are also provided with training samples of unmixed singing voices, and are better than other recent semi-supervised methods.


Emerging Disentanglement in Auto-Encoder Based Unsupervised Image Content Transfer
Ori Press, Tomer Galanti, Sagie Benaim, Lior Wolf
ICLR, 2019.
arXiv / code

We study the problem of learning to map, in an unsupervised way, between domains A and B, such that the samples b in B, contain all the information that exists in samples a in A, and some additional information.

Unsupervised Learning of the Set of Local Maxima
Lior Wolf, Sagie Benaim, Tomer Galanti
ICLR, 2019.
arXiv

We study a new form of unsupervised learning, whose input is a set of unlabeled points that are assumed to be local maxima of an unknown value function v in an unknown subset of the vector space. Two functions are learned: (i) a set indicator c, which is a binary classifier, and (ii) a comparator function h that given two nearby samples, predicts which sample has the higher value of the unknown function v.

One-Shot Unsupervised Cross Domain Translation
Sagie Benaim, Lior Wolf
NeurIPS, 2018.
arXiv / code

Given a single image x from domain A and a set of images from domain B, we consider the task of generating the analogous of x in B.


Estimating the Success of Unsupervised Image to Image Translation
Sagie Benaim*, Tomer Galanti*, Lior Wolf. *Equal contribution.
ECCV, 2018.
arXiv / code

While in supervised learning, the validation error is an unbiased estimator of the generalization (test) error and complexity-based generalization bounds are abundant, no such bounds exist for learning a mapping in an unsupervised way. We propose a novel bound for predicting the success of unsupervised cross domain mapping methods.

The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Mappings
Tomer Galanti, Sagie Benaim, Lior Wolf
ICLR, 2018.
arXiv

We discuss the feasibility of the unsupervised cross domain generation problem. In the typical setting this problem is ill posed: it seems possible to build infinitely many alternative mappings from every target mapping. We identify the abstract notion of aligning two domains and show that only a minimal architecture and a standard GAN loss is required to learn such mappings, without the need for a cycle loss.



One-Sided Unsupervised Domain Mapping
Sagie Benaim, Lior Wolf
NIPS, 2017.   (Spotlight)
arXiv / code

We consider the problem of mapping, in an unsupervised manner, between two visual domains in a one sided fashion. This is done by learning an equivariant mapping that maintains the distance between a pair of samples.

clean-usnob Complexity of Two-variable Logic on Finite Trees
Sagie Benaim, Michael Benedikt, Witold Charatonik, Emanuel Kieroński, Rastislav Lenhardt, Filip Mazowiecki, James Worrell
ICALP, 2013 and ACM Transaction of Computational Logic, Volume 17, 2016 (MSc Thesis).

This work contains a comprehensive analysis of the complexity the two-variable fragment of first-order logic FO2 on trees.

Talks
Towards a Controllable Generation of the Multimodal World, ELLIS, AI4media, and AIDA Symposium on Large Language and Foundation Models, Amsterdam, 2023.

Towards a Controllable Generation of the 3D World, Hebrew University of Jerusalem, Tel Aviv University, Weizmann Institute of Science, Bar Ilan University, Technion Institute, Haifa University, Meta AI Research Tel Aviv, 2022-2023.

Text2Mesh: Text-Driven Stylization for Meshes, Israel Computer Vision Day 2021, Pre-CVPR Day, University of Copenhagen, 2022.

Semantic Manipulation of Visual Content, Pioneer Center of AI Colloquium, hosted by Aarhus University and Technion CDS Seminar, 2021.

Structure-Aware Manipulation of Images and Videos, 2021.
Facebook AI Research (London), Stanford SVL Meeting, Google Research (Tel Aviv),
Nvidia Research (San Francisco) , Technion CDS Seminar, Tel Aviv Visual Computing Seminar.

Manipulating Structure in Images and Videos., Nvidia Research (Tel Aviv), Berkeley, 2021.

On disentangled and few shot visual generation and understanding, Google Viscam Seminar, 2020.

Learning the Speediness in Videos and Generating Novel Videos From a Single Sample, Hebrew University Vision Seminar, Technion ML Seminar, 2020.

SpeedNet: Learning the Speediness in Videos, Viz.ai, 2020.

Visual Analogies: The role of disentanglement and learning from few example, Hebrew University Vision Seminar, 2020.

Domain Intersection and Domain Difference, 2020. Amazon Research (Tel Aviv), , ICCVi, 2019.

Generative Adversarial Networks for Image to Image Translation, IMVC, 2019. Video.

New Capabilities in Unsupervised Image to Image Translation, Bar Ilan ML Seminar, 2019.

One-Shot Unsupervised Cross Domain Translation, Technion CDS Seminar, 2019.

Introduction to Generative Adversarial Networks, Elbit, 2018.

Generative Adversarial Networks for Image to Image Translation, Nexar, 2018.

One-Sided Unsupervised Domain Mapping, Hebrew University Vision Seminar, Weizmann Institute Vision Seminar, Technion Pixel Club, 2018.


This page design is based on a template by Jon Barron.