How to extract page number from Amazon Textract

0

Hello, aws people. I am extracting the content of pdf with Amazon textract. I know Amazon textract doesn't support page numbers. But I need to make sure I can extract page numbers together from the extracted results. What can I do?

mzhyo
asked 14 days ago126 views
2 Answers
0

Hi,

Textract provides various blocks as result of its analysis. Some of them are of type pages: see https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html

You can locate them by using attribute BlocType = "PAGE".

You can also get the total number of pages in attribute DocumentMetadata which contains sub-attribute Pages.

Finally, you get the page number in PageClassification under PageNumber : see https://docs.aws.amazon.com/textract/latest/dg/API_PageClassification.html

Best,

Didier

profile pictureAWS
EXPERT
answered 14 days ago
EXPERT
reviewed 13 days ago
0

A couple of important suggestions here:

  1. If it's your first time using Amazon Textract and you're able to work in Python or JS/TS, I'd suggest using the open-source helper libraries amazon-textract-textractor (Py) or amazon-textract-response-parser (JS) which can greatly simplify your code to navigate the content returned by the Textract API.
    • Especially if your goal is to extract the document into formats like HTML or Markdown, as these libraries already have tools for this.
  2. If you're specifically trying to detect page numbers written on the document, check out the Layout analysis feature, which costs extra when enabled but can detect regions like LAYOUT_FOOTER and LAYOUT_PAGE_NUMBER

As the other answer mentioned, you can also determine the number of pages in the document from the returned DocumentMetadata and refer to the parent PAGE block sequencing for the current page index of a particular piece of content.

AWS
EXPERT
Alex_T
answered 13 days ago