PDF Import and Export

The Portable Document Format (PDF) is mainly used to provide a way to exchange documents between systems without changes to the document formatting. Using PDF, a document will retain it's formatting and layout when being printed or displayed on different machines or operating systems.TX Text Control can import PDF files as text and export all supported formats to PDF and PDF/A.

Importing Text from a PDF File

PDF, like its close relative, PostScript, is a page description format. Its original purpose was to display documents on different platforms, preserving the layout and formatting to the greatest possible detail. Loading a PDF file back into a word processor, except for making very small changes, was never planned.

A PDF file therefore contains detailed information about the appearance of text characters, but not necessarily about their meaning. In other words, it specifies exactly what a character should look like and where on a page it is to be positioned, but not which Ansi or Unicode character it actually is. Because of that, it is not always possible to extract text from a PDF file.

Besides this, there is no information about text order or text flow or whether a piece of text is a header or a table cell. Although recent enhancements to the PDF specification allow for including this type of information, it is rarely used. Fortunately, the majority of PDF files contain one or another form of character mapping, which enables a PDF reader to convert the contained text to a Unicode string.

TX Text Control extracts and converts all of the text it can find, adds missing spaces and paragraph breaks, and resorts the various text blocks so that they appear in their logical order.

The resulting text will be formatted according to the setting of the LoadSettings.PDFImportSettings property.

Features of TX Text Control's PDF import:

  • Text can be imported from PDF and PDF/A files, and saved in any of the formats supported by TX Text Control
  • Logical text order and missing spaces are restored
  • Text formatting, including font names, sizes and styles
  • Unicode support
  • Adobe Acrobat(R) is not required

Loading Encrypted PDF files

A PDF may be encrypted, and may optionally be protected by a user password and an owner password. TX Text Control handles encryption internally, but the application is responsible for acquiring the respective passwords from the user, and to restrict access to certain operations, like printing or editing, if the passwords are not specified or do not match.

1. User Password

The user password is required to open the document. A FilterException exception with the Reason property set to FilterError.InvalidPassword is thrown when attempting to load a PDF that is protected with a user password. If this exception occurs, the application should ask the user for the password and retry loading the PDF with the password specified in LoadSettings.UserPassword.

If the PDF has been loaded, the LoadSettings.DocumentAccessPermissions flags specify which operations are allowed.

End user applications should disable all operations that are not allowed, for instance by graying out the Print menu item if DocumentAccessPermissions.AllowLowLevelPrinting is not included.

2. Owner Password

The owner password is required to change a PDF files's security settings. Without the correct owner password, a PDF file can be displayed, but editing or printing may be prohibited. If the correct owner password is specified in LoadSettings.MasterPassword, DocumentAccessPermissions.AllowAll will be set after the file has been loaded, and no restrictions apply.

3. Permissions

Access permissions are described in LoadSettings.DocumentAccessPermissions.

4. Encryption Algorithms

PDF files can be encrypted with a variety of encryption algorithms. Supported are:

  • Acrobat 3.0 and later 40-bit RC4
  • Acrobat 5.0 and later 128-bit RC4
  • Acrobat 6.0 and later 128-bit RC4
  • Acrobat 7.0 and later 128-bit AES

Not supported are:

  • Acrobat 9.0 and later 256-bit AES
  • Certificates

A FilterException exception with the Reason property set to FilterError.Encrypted is thrown, if one of the unsupported algorithms is encountered.

Exporting Documents as PDF Files

TX Text Control offers two different PDF versions to save a document to:

  • Standard PDF version 1.3, compatible to Acrobat 4.x and higher
  • PDF/A-1b, OSI standard

PDF/A is a format that has been created for long-term archiving of electronic documents. This PDF standard states that the created PDF file must be 100 % self-contained, e.g. all the information, that is necessary to display the content must be embedded in the file. Furthermore, audio and video content is forbidden as well as JavaScript and any executable code. Also, encryption is not allowed.

The PDF/A standard is based on the PDF Reference Version 1.4 from Adobe Systems Inc. and has been published as an ISO standard on October 1, 2005.

TX Text Control's PDF filter can export documents to PDF/A-1b, so that documents can be archived in compliance to ISO standards.

Using PDF/A in TX Text Control

To export any content in TX Text Control to a PDF/A-1b file, the Save method has to be used with the AdobePDFA StreamType enumeration. To export the page size of a document to PDF, the TextControl.ViewMode must be set to PageView.

Basic Steps to Create a PDF/A Document in an Application

1. The ViewMode property has to be set to PageView

2. FontSettings.EmbeddableFontsOnly has to be set to true.

3. The document can now be saved to PDF/A with the TextControl.Save method.

The usage of fonts can be restricted with the properties FontSettings.EmbeddableFontsOnly, FontSettings.ScalableFontsOnly and FontSettings.TrueTypeFontsOnly. The first property triggers if only embedabble can be used in a document, the second if only free scalable fonts and the latter if only TrueType (TTF) fonts are allowed to be used.

Please note that not every font can be embedded using PDF/A, which may have different reasons: Either a font's license forbids embedding or a device dependent font is used.

TX Text Control includes an advanced algorithm to replace any font that can not be embedded. For this purpose, the AdaptFont event has been implemented to provide means to programmatically set the font, that will be used to replace a non-embeddable font. The event itself must be activated with the FontSettings.AdaptFontEvent property.

In the event, the AdaptFontEventArgs.FontName property represents the original font, that can not be embedded. The AdaptFontEventArgs.AdaptedFontName property is the font name, that will be used to replace the original font and SupportedFonts is an array of all supported fonts that can be used in a document.

For each font, that can not be embedded, the AdaptFont event will be fired if FontSettings.AdaptFontEvent has been set previously to true.