Categories
Code

Extract Text from PDF files with Google Apps Script

You can use Google Apps Script to extract text from a PDF file and save the extracted text as a new document in Google Drive. The document will also retain the simple formatting of the PDF file.

The following script illustrates how to use the Google Drive API as an OCR engine and extract text from a PDF file on the Internet. The code can be modified to convert PDF files existing in Google Drive to an editable document.

function extractTextFromPDF() {
  
  // PDF File URL 
  // You can also pull PDFs from Google Drive
  var url = "https://img.labnol.org/files/Most-Useful-Websites.pdf";  
  
  var blob = UrlFetchApp.fetch(url).getBlob();
  var resource = {
    title: blob.getName(),
    mimeType: blob.getContentType()
  };
  
  // Enable the Advanced Drive API Service
  var file = Drive.Files.insert(resource, blob, {ocr: true, ocrLanguage: "en"});
  
  // Extract Text from PDF file
  var doc = DocumentApp.openById(file.id);
  var text = doc.getBody().getText();
  
  return text;
}

Google Drive API can perform OCR on JPG, PNG, GIF and PDF files. You can also specify the ocrLanguage property to specify the language to use for OCR.

Combine this with the doGet method and you’ve made an HTTP Rest API that can perform OCR on any web document with a simple GET request. This can be modified to work with file upload forms as well.

Categories
Code

Convert and Email Google Spreadsheets as PDF Files

You can set up a cron job in Google Drive using Google Apps Script that will send any Google Spreadsheet, or any other document or file in the Drive, to one or more email addresses at a specified time. You can set the time-based trigger to run weekly, daily, every hour and other recurring schedules.

This example shows how to send a Google Spreadsheet to the specified email address automatically as a PDF file daily. The Google Script converts the Google Spreadsheet into a PDF file and sends it to another email address using your own Gmail account. You can further customize the PDF output – like remove gridlines, hide frozen rows, change to landscape mode, etc. by setting the correct export parameters.

Convert & Email Google Sheets

The Email Google Spreadsheet add-on can automatically convert and email spreadsheets in PDF, CSV or Microsoft Excel (xlsx) formats. It can convert the entire spreadsheet or individual sheets.

The premium version of the add-on can automatically email the converted sheets on a recurring schedule (like every hour, daily, weekly or monthly). You can also set up multiple sending schedules and automatically email the sheet to a different set of receipts at different times.

Google Script to Email Google Spreadsheets

If you cannot use the Email Spreadsheet add-on (some Google Apps admins do not allow add-ons), you can write your own Google Script to email the spreadsheet as PDF files.


/* Send Spreadsheet in an email as PDF, automatically */
function emailSpreadsheetAsPDF() {
  // Send the PDF of the spreadsheet to this email address
  const email = Session.getActiveUser().getEmail() || 'amit@labnol.org';

  // Get the currently active spreadsheet URL (link)
  // Or use SpreadsheetApp.openByUrl("<>");
  const ss = SpreadsheetApp.getActiveSpreadsheet();

  // Subject of email message
  const subject = `PDF generated from spreadsheet ${ss.getName()}`;

  // Email Body can  be HTML too with your logo image - see ctrlq.org/html-mail
  const body = "Sent with Email Google Sheets";

  // Base URL
  const url = 'https://docs.google.com/spreadsheets/d/SS_ID/export?'.replace('SS_ID', ss.getId());

  /* Specify PDF export parameters
    From: https://code.google.com/p/google-apps-script-issues/issues/detail?id=3579
    */

  const exportOptions =
    'exportFormat=pdf&format=pdf' + // export as pdf / csv / xls / xlsx
    '&size=letter' + // paper size legal / letter / A4
    '&portrait=false' + // orientation, false for landscape
    '&fitw=true&source=labnol' + // fit to page width, false for actual size
    '&sheetnames=false&printtitle=false' + // hide optional headers and footers
    '&pagenumbers=false&gridlines=false' + // hide page numbers and gridlines
    '&fzr=false' + // do not repeat row headers (frozen rows) on each page
    '&gid='; // the sheet's Id

  const token = ScriptApp.getOAuthToken();
  const sheets = ss.getSheets();

  // make an empty array to hold your fetched blobs
  const blobs = [];

  for (let i = 0; i < sheets.length; i += 1) {
    // Convert individual worksheets to PDF
    const response = UrlFetchApp.fetch(url + exportOptions + sheets[i].getSheetId(), {
      headers: {
        Authorization: `Bearer ${token}`
      }
    });

    // convert the response to a blob and store in our array
    blobs[i] = response.getBlob().setName(`${sheets[i].getName()}.pdf`);
  }

  // create new blob that is a zip file containing our blob array
  const zipBlob = Utilities.zip(blobs).setName(`${ss.getName()}.zip`);

  // optional: save the file to the root folder of Google Drive
  DriveApp.createFile(zipBlob);

  // Define the scope
  Logger.log(`Storage Space used: ${DriveApp.getStorageUsed()}`);

  // If allowed to send emails, send the email with the PDF attachment
  if (MailApp.getRemainingDailyQuota() > 0)
    GmailApp.sendEmail(email, subject, body, {
      htmlBody: body,
      attachments: [zipBlob]
    });
}

The Google Script function will convert each of the worksheets of a Google spreadsheet into a PDF file, compresses all the PDF files in a ZIP file and sends it to a designated email address. You can send to multiple email addresses too – just separate each of them with a comma.

The method currently sends all the sheets of a Spreadsheet in the ZIP file but you can also specify a sheet ID with the &gid parameter to email a particular sheet only. For instance, to send the first sheet, you can set the gid=0 and so on.

Convert Full Google Sheet to a PDF File

The above snippet converts individual sheets into separate PDF files but there’s also a way to convert the entire spreadsheet into a single PDF file. In that case, replace guid= with id=SS_ID (spreadsheet ID) or perform the conversion using DriveApp as shown here.

/* Email Google Spreadsheet as PDF */
function emailGoogleSpreadsheetAsPDF() {
  
  // Send the PDF of the spreadsheet to this email address
  var email = "amit@labnol.org"; 
  
  // Get the currently active spreadsheet URL (link)
  var ss = SpreadsheetApp.getActiveSpreadsheet();

  // Subject of email message
  var subject = "PDF generated from spreadsheet " + ss.getName(); 

  // Email Body can  be HTML too 
  var body = "Sent via Email Spreadsheet";
  
  var blob = DriveApp.getFileById(ss.getId()).getAs("application/pdf");
  
  blob.setName(ss.getName() + ".pdf");
  
  // If allowed to send emails, send the email with the PDF attachment
  if (MailApp.getRemainingDailyQuota() > 0) 
    GmailApp.sendEmail(email, subject, body, {
      htmlBody: body,
      attachments:[blob]     
    });  
}
Categories
Code

Count Number of Pages in PDF with Google Script

The Google Documents API doesn’t offer a method to get the number of pages in a document. Dave Lam has an interesting workaround that will not only help you count the pages in a Google Document but of any PDF file that’s in your Google Drive.

The idea is to convert the Google Document into PDF and then parse the content of the PDF document as a text file using Regular Expressions.


/* Credit: https://code.google.com/p/google-apps-script-issues/issues/detail?id=1656 */

function countPDFPages() {

  var blob = DocumentApp.getActiveDocument().getAs("application/pdf");
  var data = blob.getDataAsString();

  var re = /Pages\/Count (\d+)/g;
  var match;
  var pages = 0;

  while (match = re.exec(data)) {

    var value = parseInt(match[1]);

    if (value > pages) {
      pages = value;
    }

  }

  Logger.log("Total PDF pages = " + pages);

  return pages;
}
Categories
Code

Convert Image to PDF to Google Script

The Google Script will fetch an image file, convert it to PDF and sends it as an image attachment using the Gmail service. The image file can be on your Google Drive or it can be on web (specify the image URL as the filename).

Also see: How to Convert Files

function convertImageToPDF(filename) {
  
  var image;
  
  // Is it a local file or web URL?
  if (filename.match(/^https?:\/\//i)) {
    image = UrlFetchApp.fetch(filename);
  } else {
    image = DriveApp.getFilesByName(filename).next();
  } 
  
  // grab its bytes and base64-encode them.
  var base64 = Utilities.base64Encode(image.getBlob().getBytes());
  var html = '';
  
  // create a blob, convert to PDF
  var blob = Utilities.newBlob(html, MimeType.HTML).setName(filename + ".pdf");
  
  MailApp.sendEmail("ctrlq@labnol.org", "Image to PDF", "", {
    attachments:blob.getAs(MimeType.PDF)
  });
  
}
Categories
Code

Convert PDF Files to Text with OCR

Google Drive can extract text from regular PDF files as well as scanned PDFs though OCR. SO member Mogsdad has written a wrapper utility for Apps Script that can be used for converting PDF files to Google Documents and it can perform OCR as well. You do need to enable Advanced Drive services from the Google Dashboard.


 /* See gist: https://gist.github.com/mogsdad/e6795e438615d252584f */

  var blob = DriveApp.getFileById(PDF_FILE_ID).getBlob();
  var text = pdfToText(blob, {ocrLanguage: "en"});
  Logger.log(text);


/**
 * Convert pdf file (blob) to a text file on Drive, using built-in OCR.
 * By default, the text file will be placed in the root folder, with the same
 * name as source pdf (but extension 'txt'). Options:
 */

function pdfToText ( pdfFile, options ) {
  // Ensure Advanced Drive Service is enabled
  try {
    Drive.Files.list();
  }
  catch (e) {
    throw new Error( "Enable 'Drive API' in Resources - Advanced Google Services." );
  }
  
  
  // Prepare resource object for file creation
  var parents = [];
  var pdfName = pdfFile.getName();

  var resource = {
    title: pdfName,
    mimeType: pdfFile.getContentType(),
    parents: parents
  };
  
  // Save PDF as GDOC
  resource.title = pdfName.replace(/pdf$/, 'gdoc');
  var insertOpts = {
    ocr: true,
    ocrLanguage: options.ocrLanguage || 'en'
  }

  var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);
  
  // Get text from GDOC  
  var gdocDoc = DocumentApp.openById(gdocFile.id);
  var text = gdocDoc.getBody().getText();
  
  // Save text file, if requested
  resource.title = pdfName.replace(/pdf$/, 'txt');
  resource.mimeType = MimeType.PLAIN_TEXT;

  var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
  var textFile = Drive.Files.insert(resource, textBlob);
  
  return text;
}
Categories
Code

Convert HTML to PDF with Apps Script

With Google Script, you can convert any HTML content into a PDF file with a few lines of code. The PDF file can be saved to Google Drive or you can email the file as an attachment using the GmailApp service.

function htmlToPDF() {
  
  var html = "

Hello world

" + "

The quick brown fox jumped over the lazy dog"; var blob = Utilities.newBlob(html, "text/html", "text.html"); var pdf = blob.getAs("application/pdf"); DriveApp.createFile(pdf).setName("text.pdf"); MailApp.sendEmail("email@labnol.org", "PDF File", "", {htmlBody: html, attachments: pdf}); }