Technical News

NYU’s new AI architecture makes generating high-quality images faster and less expensive

Researchers at New York University have developed a new architecture for diffusion models that improves the semantic representation of the images they generate. “Diffusion Transformer with Representation Autoencoders“(RAE) challenges some of the accepted norms for building diffusion models. The NYU researcher’s model is more efficient and accurate than standard diffusion models, takes advantage of the latest research in representation learning, and could pave the way for new applications that were previously too difficult or expensive.

This advancement could unlock more reliable and powerful features for enterprise applications. "To edit images well, a model must truly understand what’s in them," Saining Xie, co-author of the paper, told VentureBeat. "RAE helps to connect this understanding part with the generation part." He also discussed future applications in "RAG-based generation, where you use the RAE encoder features for searching and then generate new images based on the search results," as well as in "generation of videos and models of the world conditioned by action."

The state of generative modeling

Delivery modelsThe technology behind most of today’s powerful image generators, image generation is a process of learning to compress and decompress images. A variational autoencoder (VAE) learns a compact representation of key features of an image in what is called “latent space”. The model is then trained to generate new images by reversing this process from random noise.

Although the delivery part of these models has progressed, the autoencoder used in most of them has remained largely unchanged in recent years. According to the NYU researchers, this standard autoencoder (SD-VAE) is suitable for capturing low-level and local-appearing features, but it lacks “the global semantic structure crucial for generalization and generative performance.”

At the same time, the field has seen impressive progress in learning image representation with models such as DINO, MAE and CLIP. These models learn semantically structured visual features that generalize across tasks and can serve as a natural basis for visual understanding. However, a widely held belief has prevented developers from using these architectures in image generation: semantic-driven models are not suitable for image generation because they do not capture granular features at the pixel level. Practitioners also believe that diffusion models do not work well with the type of high-dimensional representations produced by semantic models.

Broadcast with representation encoders

NYU researchers propose replacing standard VAE with “representational autoencoders” (RAE). This new type of autoencoder combines a pre-trained representation encoder, such as Meta’s dinosaurwith a trained vision transformer decoder. This approach simplifies the training process by using powerful, existing encoders that have already been trained on large datasets.

To make this work, the team developed a variation of the diffusion transform (DiT), the backbone of most image generation models. This modified DiT can be trained efficiently in the high-dimensional space of RAEs without incurring huge computational costs. The researchers show that fixed representation encoders, even those optimized for semantics, can be adapted to image generation tasks. Their method yields superior reconstructions to standard SD-VAE without adding architectural complexity.

However, adopting this approach requires a change in mindset. "RAE is not a simple plug-and-play autoencoder; the diffusion modeling part must also evolve," Xie explained. "A key point we want to emphasize is that latent space modeling and generative modeling should be designed together rather than treated separately."

With the right architectural adjustments, researchers have found that higher-dimensional representations are an advantage, providing richer structure, faster convergence, and better generation quality. In their paperthe researchers note that these "higher-dimensional latents effectively introduce no additional computational or memory costs." Additionally, standard SD-VAE is more computationally expensive, requiring approximately six times more computation for the encoder and three times more for the decoder, compared to RAE.

Increased performance and efficiency

The new model architecture provides significant gains in training efficiency and generation quality. The team’s improved delivery recipe delivers strong results after just 80 training periods. Compared to previous diffusion models trained on VAEs, the RAE-based model achieves a training speedup of 47x. It also outperforms recent methods based on representation alignment with 16x training speedup. This level of efficiency directly translates into lower training costs and faster model development cycles.

For enterprise use, this translates to more reliable and consistent results. Xie noted that RAE-based models are less prone to the semantic errors seen in classical broadcasting, adding that RAE gives the model "a much smarter view of data." He observed that flagship models like ChatGPT-4o and Google’s Nano Banana are moving towards "topic-driven, highly coherent, knowledge-augmented generation," and that the semantically rich basis of RAE is essential to achieve this reliability at scale and in open source models.

The researchers demonstrated this performance on the ImageNet benchmark. Using the Fréchet Inception Distance (FID), where a lower score indicates higher quality images, the RAE-based model achieved a state-of-the-art score of 1.51 without guidance. With AutoGuidance, a technique that uses a smaller model to drive the generation process, the FID score fell to an even more impressive 1.13 for 256 x 256 and 512 x 512 images.

By successfully integrating modern representation learning into the diffusion framework, this work opens a new avenue for building more efficient and cost-effective generative models. This unification portends a future of more integrated AI systems.

"We believe that in the future there will be a single, unified representation model that captures the rich, underlying structure of reality…capable of decoding it into many different output modalities," Xie said. He added that RAE offers a unique path to this goal: "The high-dimensional latent space must be learned separately to provide a strong prior that can then be decoded in various modalities – rather than relying on a brute force approach of shuffling all the data and training multiple objectives at once."

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button