Artwork

Konten disediakan oleh Rob. Semua konten podcast termasuk episode, grafik, dan deskripsi podcast diunggah dan disediakan langsung oleh Rob atau mitra platform podcast mereka. Jika Anda yakin seseorang menggunakan karya berhak cipta Anda tanpa izin, Anda dapat mengikuti proses yang diuraikan di sini https://id.player.fm/legal.
Player FM - Aplikasi Podcast
Offline dengan aplikasi Player FM !

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

33:45
 
Bagikan
 

Seri yang sudah diarsipkan ("Feed tidak aktif" status)

When? This feed was archived on August 11, 2025 06:07 (4M ago). Last successful fetch was on November 01, 2024 13:33 (1y ago)

Why? Feed tidak aktif status. Server kami tidak mendapatkan feed podcast yang valid secara terus-menerus.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Manage episode 445136835 series 2954468
Konten disediakan oleh Rob. Semua konten podcast termasuk episode, grafik, dan deskripsi podcast diunggah dan disediakan langsung oleh Rob atau mitra platform podcast mereka. Jika Anda yakin seseorang menggunakan karya berhak cipta Anda tanpa izin, Anda dapat mengikuti proses yang diuraikan di sini https://id.player.fm/legal.
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially in academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler is integrated with the MLLM's image encoder and selects the paragraphs or diagrams most pertinent to user queries for processing by the language model. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of academic papers sourced from arXiv, multiple strategies are proposed to generate automatically 1M QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal PDF understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.
2024: Xudong Xie, Liang Yin, Hao Yan, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai
https://arxiv.org/pdf/2410.05970v1
  continue reading

298 episode

Artwork
iconBagikan
 

Seri yang sudah diarsipkan ("Feed tidak aktif" status)

When? This feed was archived on August 11, 2025 06:07 (4M ago). Last successful fetch was on November 01, 2024 13:33 (1y ago)

Why? Feed tidak aktif status. Server kami tidak mendapatkan feed podcast yang valid secara terus-menerus.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Manage episode 445136835 series 2954468
Konten disediakan oleh Rob. Semua konten podcast termasuk episode, grafik, dan deskripsi podcast diunggah dan disediakan langsung oleh Rob atau mitra platform podcast mereka. Jika Anda yakin seseorang menggunakan karya berhak cipta Anda tanpa izin, Anda dapat mengikuti proses yang diuraikan di sini https://id.player.fm/legal.
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially in academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler is integrated with the MLLM's image encoder and selects the paragraphs or diagrams most pertinent to user queries for processing by the language model. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of academic papers sourced from arXiv, multiple strategies are proposed to generate automatically 1M QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal PDF understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.
2024: Xudong Xie, Liang Yin, Hao Yan, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai
https://arxiv.org/pdf/2410.05970v1
  continue reading

298 episode

Semua episode

×
 
Loading …

Selamat datang di Player FM!

Player FM memindai web untuk mencari podcast berkualitas tinggi untuk Anda nikmati saat ini. Ini adalah aplikasi podcast terbaik dan bekerja untuk Android, iPhone, dan web. Daftar untuk menyinkronkan langganan di seluruh perangkat.

 

Panduan Referensi Cepat

Dengarkan acara ini sambil menjelajah
Putar