Sagie Benaim

I am a Postdoc at DIKU working with Prof. Serge Belongie and a member of the Pioneer Center for AI.

Previously, I was a PhD candidate at Tel Aviv University, working in the Deep Learning Lab under the supervision of Prof. Lior Wolf.

I work in the intersection of computer vision and machine learning. I am interested in computer vision for AR/VR, semi-supervised and self-supervised learning, as well as few-shot learning. Much of my work focuses on content creation, generative models, image to image translation, domain adaptation, and disentanglement. I am also interested in finding general inductive biases, which can be used to learn from a few examples, as well as to reduce the level of supervision. I spent the summer of 2019 at Google Research working on Self Supervised learning for Videos.

Email  /  CV  /  Github  /  LinkedIn  /  Google Scholar  /  Twitter

profile photo

Volumetric Disentanglement for 3D Scene Manipulation
Sagie Benaim, Frederik Warburg, Peter Ebert Christensen, Serge Belongie
arXiv, 2022. project page / arXiv

We propose a framework for disentangling a 3D scene into a forground and background volumetric representations and show a variety of downstream applications involving 3D manipulation.

Image-Based CLIP-Guided Essence Transfer
Hila Chefer, Sagie Benaim, Roni Paiss, Lior Wolf
ECCV, 2022. arXiv / code / 5 minute summary

A new style (essense) transfer method that incoporates higher level abstractions then textures and colors. TargetCLIP introduces a blending operator that combines the powerful StyleGAN2 generator with a semantic network CLIP to achieve a more natural blending than with each model separately.

Text-Driven Stylization of Video Objects
Sebastian Loeschcke, Serge Belongie, Sagie Benaim
ECCV Workshop on AI for Creative Video Editing and Understanding, 2022. arXiv / project page

A method for stylizing video objects in an intuitive and semantic manner following a user-specified text prompt.

Text2Mesh: Text-Driven Neural Stylization for Meshes
Oscar Michel*, Roi Bar-On*, Richard Liu*, Sagie Benaim, Rana Hanocka
CVPR, 2022.   (Oral Presentation)
project page / arXiv / code

Text2Mesh produces color and geometric details over a variety of source meshes, driven by a target text prompt. Our stylization results coherently blend unique and ostensibly unrelated combinations of text, capturing both global semantics and part-aware attributes.

JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting
Ron Mokady, Rotem Tzaban, Sagie Benaim, Amit Bermano, Daniel Cohen-Or
Computer Graphics Forum, 2022. project page / arXiv / code

We introduce JOKR - a JOint Keypoint Representation that captures the motion common to both the source and target videos, without requiring any object prior or data collection. This geometry-driven representation allows for unsupervised motion retargeting in a variery of challenging situations as well as for further intuitive control, such as temporal coherence and manual editing.

FewGAN: Generating from the Joint Distribution of a Few Images
Lior Ben Moshe, Sagie Benaim, Lior Wolf
ICIP, 2022. arXiv

FewGAN is a generative model for generating novel, high-quality and diverse images whose patch distribution lies in the joint patch distribution of a small number of N training samples.

Locally Shifted Attention With Early Global Integration
Shelly Sheynin, Sagie Benaim, Adam Polyak, Lior Wolf
arXiv, 2021. arXiv / code

A new image transformer architecture which first applies a local attention over patches and their local shifts, resulting in virtually located local patches, which are not bound to a single, specific location. Subsequently, these virtually located patches are used in a global attention layer.

A Hierarchical Transformation-Discriminating Generative Model for Few Shot Anomaly Detection
Shelly Sheynin*, Sagie Benaim*, Lior Wolf
ICCV, 2021. project page / arXiv / code

We consider the setting of few-shot anomaly detection in images, where only a few images are given at training. We devise a hierarchical generative model that captures the multi-scale patch distribution of each training image. We further enhance the representation of our model by using image transformations and optimize scale-specific patch-discriminators to distinguish between real and fake patches of the image, as well as between different transformations applied to those patches.

Permuted AdaIN: Reducing the Bias Towards Global Statistics in Image Classification
Oren Nuriel, Sagie Benaim, Lior Wolf
CVPR, 2021. arXiv / code

A simple architectural change which forces the network to reduce its bias to global image statistics. Using AdaIN, we swap global statistics of samples within a batch, stocastically, with some probability p. This results in significant improvements in multiple settings including domain adaptation, domain generalization, robustness and image classification.

Identity and Attribute Preserving Thumbnail Upscaling
Noam Gat, Sagie Benaim, Lior Wolf
ICIP, 2021. arXiv / code

StyleGAN can be used to upscale a low resolution thumbnail image of a person, to a higher resolution image. However, it often changes the person’s identity, or produces biased solutions, such as Caucasian faces. We present a method to upscale an image that preserves the person's identity and other attributes.

Risk Bounds for Unsupervised Cross-Domain Mapping with IPMs
Tomer Galanti, Sagie Benaim, Lior Wolf
Journal of Machine Learning Research (JMLR), 2021. arXiv

We develop theoretical foundations for the success of unsupervised cross-domain mapping algorithms, in mapping between two domains that share common characteristics, with a particular emphasis on the clear ambiguity in such mappings.

Evaluation Metrics for Conditional Image Generation
Yaniv Beniv, Tomer Galanti, Sagie Benaim, Lior Wolf
International Journal of Computer Vision (IJCV), 2020. arXiv

Two new metrics for evaluating generative models in the class-conditional image generation setting. These metrics are obtained by generalizing the two most popular unconditional metrics: the Inception Score (IS) and the Fréchet Inception Distance (FID).

Structural-analogy from a Single Image Pair
Sagie Benaim*, Ron Mokday*, Amit Bermano, Daniel Cohen-Or, Lior Wolf
Computer Graphics Forum, 2020.
Also in the Deep Internal Learning workshop, ECCV 2020.
project page / arXiv / code / video

We explore the capabilities of neural networks to understand image structure given only a single pair of images, A and B. We seek to generate images that are structurally aligned: that is, to generate an image that keeps the appearance and style of B, but has a structural arrangement that corresponds to A. Our method can be used for: guided image synthesis, style and texture transfer, text translation as well as video translation.

Hierarchical Patch VAE-GAN: Generating Diverse Videos from a Single Sample
Shir Gur*, Sagie Benaim*, Lior Wolf
NeurIPS, 2020.
Also in the Deep Internal Learning workshop, ECCV 2020.
project page / arXiv / code / video

We consider the task of generating diverse and novel videos from a single video sample. We introduce a novel patch-based variational autoencoder (VAE) which allows for a much greater diversity in generation. Using this tool, a new hierarchical video generation scheme is constructed resulting in diverse and high quality videos.

SpeedNet: Learning the Speediness in Videos
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, Tali Dekel
CVPR, 2020.   (Oral Presentation)
project page / arXiv / video

We train a network called SpeedNet to to automatically predict the "speediness" of moving objects in videos - whether they move faster, at, or slower than their "natural" speed. SpeedNet is trained in a self-supervised manner and can be used to generate time-varying, adaptive video speedups as well as to boost the performance of self-supervised action recognition and video retrieval.

Masked Based Unsupervised Content Transfer
Ron Mokday, Sagie Benaim, Amit Bermano, Lior Wolf
ICLR, 2020.  
arXiv / code / video

We consider the problem of translating, in an unsupervised manner, between two domains where one contains some additional information compared to the other. To do so, we disentangle the common and separate parts of these domains and, through the generation of a mask, focuses the attention of the underlying network to the desired augmentation, without wastefully reconstructing the entire target.

Domain Intersection and Domain Difference
Sagie Benaim, Michael Khaitov, Tomer Galanti, Lior Wolf
ICCV, 2019.
arXiv / code

We present a method for recovering the shared content between two visual domains as well as the content that is unique to each domain. This allows us to remove content specific content of the first domain and add content specific to the second domain. We can also generate form the intersection of the two domains and their union, despite having no such samples during training.

Semi-Supervised Monaural Singing Voice Separation With a Masking Network Trained on Synthetic Mixtures
Michael Michelashvili, Sagie Benaim, Lior Wolf
ICASSP, 2019.
arXiv / code / samples

We study the problem of semi-supervised singing voice separation, in which the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music. Our results indicate that we are on a par with or better than fully supervised methods, which are also provided with training samples of unmixed singing voices, and are better than other recent semi-supervised methods.

Emerging Disentanglement in Auto-Encoder Based Unsupervised Image Content Transfer
Ori Press, Tomer Galanti, Sagie Benaim, Lior Wolf
ICLR, 2019.
arXiv / code

We study the problem of learning to map, in an unsupervised way, between domains A and B, such that the samples b in B, contain all the information that exists in samples a in A, and some additional information.

Unsupervised Learning of the Set of Local Maxima
Lior Wolf, Sagie Benaim, Tomer Galanti
ICLR, 2019.

We study a new form of unsupervised learning, whose input is a set of unlabeled points that are assumed to be local maxima of an unknown value function v in an unknown subset of the vector space. Two functions are learned: (i) a set indicator c, which is a binary classifier, and (ii) a comparator function h that given two nearby samples, predicts which sample has the higher value of the unknown function v.

One-Shot Unsupervised Cross Domain Translation
Sagie Benaim, Lior Wolf
NeurIPS, 2018.
arXiv / code

Given a single image x from domain A and a set of images from domain B, we consider the task of generating the analogous of x in B.

Estimating the Success of Unsupervised Image to Image Translation
Sagie Benaim*, Tomer Galanti*, Lior Wolf
ECCV, 2018.
arXiv / code

While in supervised learning, the validation error is an unbiased estimator of the generalization (test) error and complexity-based generalization bounds are abundant, no such bounds exist for learning a mapping in an unsupervised way. We propose a novel bound for predicting the success of unsupervised cross domain mapping methods.

The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Mappings
Tomer Galanti, Sagie Benaim, Lior Wolf
ICLR, 2018.

We discuss the feasibility of the unsupervised cross domain generation problem. In the typical setting this problem is ill posed: it seems possible to build infinitely many alternative mappings from every target mapping. We identify the abstract notion of aligning two domains and show that only a minimal architecture and a standard GAN loss is required to learn such mappings, without the need for a cycle loss.

One-Sided Unsupervised Domain Mapping
Sagie Benaim, Lior Wolf
NIPS, 2017.   (Spotlight)
arXiv / code

We consider the problem of mapping, in an unsupervised manner, between two visual domains in a one sided fashion. This is done by learning an equivariant mapping that maintains the distance between a pair of samples.

clean-usnob Complexity of Two-variable Logic on Finite Trees
Sagie Benaim, Michael Benedikt, Witold Charatonik, Emanuel Kieroński, Rastislav Lenhardt, Filip Mazowiecki, James Worrell
ICALP, 2013 and ACM Transaction of Computational Logic, Volume 17, 2016 (MSc Thesis).

This work contains a comprehensive analysis of the complexity the two-variable fragment of first-order logic FO2 on trees.

Text2Mesh: Text-Driven Stylization for Meshes, Israel Computer Vision Day 2021.

Semantic Manipulation of Visual Content, Pioneer Center of AI Colloquium, hosted by Aarhus University and Technion CDS Seminar, 2021.

Structure-Aware Manipulation of Images and Videos, 2021.
Facebook AI Research (London), Stanford SVL Meeting, Google Research (Tel Aviv),
Nvidia Research (San Francisco) , Technion CDS Seminar, Tel Aviv Visual Computing Seminar.

Manipulating Structure in Images and Videos., Nvidia Research (Tel Aviv), Berkeley, 2021.

On disentangled and few shot visual generation and understanding, Google Viscam Seminar, 2020.

Learning the Speediness in Videos and Generating Novel Videos From a Single Sample, Hebrew University Vision Seminar, Technion ML Seminar, 2020.

SpeedNet: Learning the Speediness in Videos,, 2020.

Visual Analogies: The role of disentanglement and learning from few example, Hebrew University Vision Seminar, 2020.

Domain Intersection and Domain Difference, 2020. Amazon Research (Tel Aviv), , ICCVi, 2019.

Generative Adversarial Networks for Image to Image Translation, IMVC, 2019. Video.

New Capabilities in Unsupervised Image to Image Translation, Bar Ilan ML Seminar, 2019.

One-Shot Unsupervised Cross Domain Translation, Technion CDS Seminar, 2019.

Introduction to Generative Adversarial Networks, Elbit, 2018.

Generative Adversarial Networks for Image to Image Translation, Nexar, 2018.

One-Sided Unsupervised Domain Mapping, Hebrew University Vision Seminar, Weizmann Institute Vision Seminar, Technion Pixel Club, 2018.
Convolutional Neural Networks. Tel Aviv University. Spring 2019, Spring 2020, Spring 2021.
Awarded The Raymond and Beverly Sackler Excellence Scholarship for the Faculty of Exact Sciences. January 2018.
Voluntary Activities
Reviewer for NeurIPS, ICLR, CVPR, ICML, ECCV, ICCV.

This page design is based on a template by Jon Barron.