Extract data from pdf
Go to file
2019-07-22 20:20:37 +02:00
lib Start of pdf extracting nodejs library based on pdfjs 2019-07-22 19:59:22 +02:00
.gitignore Start of pdf extracting nodejs library based on pdfjs 2019-07-22 19:59:22 +02:00
LICENSE Add LICENSE, README update package.json with valid repository url 2019-07-22 20:20:37 +02:00
main.js Start of pdf extracting nodejs library based on pdfjs 2019-07-22 19:59:22 +02:00
package.json Add LICENSE, README update package.json with valid repository url 2019-07-22 20:20:37 +02:00
README.md Add LICENSE, README update package.json with valid repository url 2019-07-22 20:20:37 +02:00

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js.

Work in progress

Supports:

  • extract text
    • separate each page
    • separate each line
    • separate font information
    • bounding box position

TODO:

  • specify output format and output directory
  • output to xml format
  • output to json format
  • extract images to files
  • extract font
  • extract tables
  • advanced font information
  • extract forms
  • extract drawings