- Newest
- Most votes
- Most comments
Hi,
Textract provides various blocks as result of its analysis. Some of them are of type pages: see https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html
You can locate them by using attribute BlocType = "PAGE".
You can also get the total number of pages in attribute DocumentMetadata
which contains sub-attribute
Pages.
Finally, you get the page number in PageClassification
under PageNumber
: see
https://docs.aws.amazon.com/textract/latest/dg/API_PageClassification.html
Best,
Didier
A couple of important suggestions here:
- If it's your first time using Amazon Textract and you're able to work in Python or JS/TS, I'd suggest using the open-source helper libraries amazon-textract-textractor (Py) or amazon-textract-response-parser (JS) which can greatly simplify your code to navigate the content returned by the Textract API.
- Especially if your goal is to extract the document into formats like HTML or Markdown, as these libraries already have tools for this.
- If you're specifically trying to detect page numbers written on the document, check out the Layout analysis feature, which costs extra when enabled but can detect regions like
LAYOUT_FOOTER
andLAYOUT_PAGE_NUMBER
As the other answer mentioned, you can also determine the number of pages in the document from the returned DocumentMetadata
and refer to the parent PAGE
block sequencing for the current page index of a particular piece of content.
Relevant content
- asked a year ago
- asked a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 9 months ago