Trending Research

Qwen2-Audio Technical Report

qwenlm/qwen2-audio • 15 Jul 2024

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

Instruction Following Language Modelling

413

3.89 stars / hour

Paper
Code

IMAGDressing-v1: Customizable Virtual Dressing

muzishen/imagdressing • • 17 Jul 2024

Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience.

Denoising Image Generation +1

328

2.52 stars / hour

Paper
Code

SEED-Story: Multimodal Long Story Generation with Large Language Model

tencentarc/seed-story • • 11 Jul 2024

We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.

Image Generation Language Modelling +3

539

1.73 stars / hour

Paper
Code

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

stanford-oval/storm • 22 Feb 2024

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.

Retrieval

7,678

1.61 stars / hour

Paper
Code

Scaling Diffusion Transformers to 16 Billion Parameters

feizc/dit-moe • • 16 Jul 2024

In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference.

Attribute Conditional Image Generation +2

1.56 stars / hour

Paper
Code

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

funaudiollm/cosyvoice • • 4 Jul 2024

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).

Emotion Recognition Event Detection +6

2,378

1.49 stars / hour

Paper
Code

Cradle: Empowering Foundation Agents Towards General Computer Control

baai-agents/cradle • • 5 Mar 2024

To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i. e., using screenshots as input and keyboard and mouse actions as output.

Efficient Exploration

1,197

1.22 stars / hour

Paper
Code

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

chenyirui/gim • 24 Jun 2024

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location(IMDL).

Image Manipulation Image Manipulation Detection