How to extract page number from Amazon Textract

Hello, aws people. I am extracting the content of pdf with Amazon textract. I know Amazon textract doesn't support page numbers. But I need to make sure I can extract page numbers together from the extracted results. What can I do?

Topics

Machine Learning & AI Generative AI on AWS

Tags

Amazon Textract Amazon Bedrock

Language

English

mzhyo

asked 14 days ago126 views

2 Answers

Newest
Most votes
Most comments

Are these answers helpful? Upvote the correct answer to help the community benefit from your knowledge.

Hi,

Textract provides various blocks as result of its analysis. Some of them are of type pages: see https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html

You can locate them by using attribute BlocType = "PAGE".

You can also get the total number of pages in attribute DocumentMetadata which contains sub-attribute Pages.

Finally, you get the page number in PageClassification under PageNumber : see https://docs.aws.amazon.com/textract/latest/dg/API_PageClassification.html

Best,

Didier

EXPERT

Didier_Durand

answered 14 days ago

EXPERT

Pandurangaswamy

reviewed 13 days ago

A couple of important suggestions here:

If it's your first time using Amazon Textract and you're able to work in Python or JS/TS, I'd suggest using the open-source helper libraries amazon-textract-textractor (Py) or amazon-textract-response-parser (JS) which can greatly simplify your code to navigate the content returned by the Textract API.
- Especially if your goal is to extract the document into formats like HTML or Markdown, as these libraries already have tools for this.
If you're specifically trying to detect page numbers written on the document, check out the Layout analysis feature, which costs extra when enabled but can detect regions like LAYOUT_FOOTER and LAYOUT_PAGE_NUMBER

As the other answer mentioned, you can also determine the number of pages in the document from the returned DocumentMetadata and refer to the parent PAGE block sequencing for the current page index of a particular piece of content.

EXPERT

Alex_T

answered 13 days ago

Relevant content

Specify pages to extract from pdf with C# AWS SDK in Textract
MagnaObscura
asked 2 years ago
Textract Json extracts
Rajesh
asked a year ago
Amazon Textract extraction speed
CDT
asked a year ago
Extracting data from PDF that contains strikeout text using Amazon Textract in Python
SomebodySysop
asked a month ago
How do I fill out the LOA to request a port of my phone numbers to Amazon Connect?
AWS OFFICIALUpdated a year ago
How do I prevent agents in my Amazon Connect contact center from making outbound calls to specific phone numbers?
AWS OFFICIALUpdated 3 years ago
Why am I unable to delete a phone number from my Amazon Chime SDK Voice Connector?
AWS OFFICIALUpdated 9 months ago
How do I resolve the "No origination identity available to send to destination number" error when sending SMS messages to United States destination numbers from Amazon SNS or Amazon Pinpoint?
AWS OFFICIALUpdated a year ago
Unlocking the Power of Amazon Q for VMware Cloud on AWS Workloads
EXPERT
Greg Vinton
published 4 months ago
Amazon Managed Grafana now supports workspace configuration with version 9.4 option
EXPERT
Mengdi Chen
published a year ago