Pro.XML — API for parsing XML documents

Overview

The Pro.XML module contains the API for parsing XML documents.

Extracting JavaScript from an XDP Document

XPD objects (XML Data Package) are XML documents contained in PDF documents that can contain JavaScript code and even embed other PDF documents.

The following code example demonstrates how to extract JavaScript code from an XDP document:

from Pro.Core import *
from Pro.XML import *

def parseXDP(fname):
    c = createContainerFromFile(fname)
    if c.isNull():
        return
    obj = XMLObject()
    if not obj.Load(c):
        return
    def callback(ud, type, opaque_identifier):
        if type == XMLObj_JavaScript:
            c = obj.GetObject(type, opaque_identifier)
            if c.isValid():
                print(c.read(0, c.size()).decode("utf-8", errors="ignore"))
        return True
    obj.SetIdentifiedObjectsCallback(callback, None)
    obj.IdentifyObjects()

Module API

Pro.XML module API.

Attributes:

XMLObj_File

Represents a file object embedded within the XML document.

XMLObj_JavaScript

Represents a JavaScript object embedded within the XML document.

XMLObj_VBS

Represents a VBScript object embedded within the XML document.

Classes:

XMLObject()

Represents an XML document object and provides methods to parse and manipulate XML content.

XMLParseHelper()

Helper class for parsing XML documents.

XMLObj_File: Final[int]

Represents a file object embedded within the XML document.

See also XMLObject.GetObject().

XMLObj_JavaScript: Final[int]

Represents a JavaScript object embedded within the XML document.

See also XMLObject.GetObject().

XMLObj_VBS: Final[int]

Represents a VBScript object embedded within the XML document.

See also XMLObject.GetObject().

class XMLObject

Bases: Pro.Core.CFFObject

Represents an XML document object and provides methods to parse and manipulate XML content.

See also XMLParseHelper.

Methods:

GetObject(type, opaque_identifier)

Retrieves an embedded object from the XML document.

GetXML()

Retrieves the XML document associated with this object.

GetXMLHelper()

Retrieves the XML parse helper associated with this object.

IdentifyObjects()

Identifies embedded objects within the XML document.

IdentifyXMLFromElementName(name)

Identifies the type of XML content based on the element name.

SetIdentifiedObjectsCallback(cb, ud)

Sets a callback function to be invoked when embedded objects are identified.

SetXML(xml)

Sets the XML document for this object.

SetXMLHelper(h_or_name)

Sets the XML parse helper for this object.

GetObject(type: int, opaque_identifier: bytes)Pro.Core.NTContainer

Retrieves an embedded object from the XML document.

Parameters
  • type (int) – The type of the object to retrieve (e.g., XMLObj_File).

  • opaque_identifier (bytes) – An opaque identifier used to locate the object within the XML structure.

Returns

Returns the requested object encapsulated in an NTContainer.

Return type

NTContainer

See also XMLObj_File, XMLObj_JavaScript and XMLObj_VBS.

GetXML()Pro.Core.NTXml

Retrieves the XML document associated with this object.

Returns

Returns the XML content as an Pro.Core.NTXml object.

Return type

NTXml

See also SetXML().

GetXMLHelper()Pro.XML.XMLParseHelper

Retrieves the XML parse helper associated with this object.

Returns

Returns the XML parse helper.

Return type

XMLParseHelper

See also SetXMLHelper().

IdentifyObjects()None

Identifies embedded objects within the XML document.

This method processes the XML content to locate and identify any embedded objects, such as files or scripts.

See also GetObject().

static IdentifyXMLFromElementName(name: str)str

Identifies the type of XML content based on the element name.

Parameters

name (str) – The name of the XML element.

Returns

Returns a string representing the identified type of XML content.

Return type

str

SetIdentifiedObjectsCallback(cb: object, ud: object)None

Sets a callback function to be invoked when embedded objects are identified.

Parameters
  • cb (object) – The callback function to be called.

  • ud (object) – User-defined data to be passed to the callback.

See also IdentifyObjects().

SetXML(xml: Pro.Core.NTXml)None

Sets the XML document for this object.

Parameters

xml (NTXml) – The XML content to associate with this object.

See also GetXML().

SetXMLHelper(h_or_name: Union[Pro.XML.XMLParseHelper, str])None

Sets the XML parse helper for this object.

Parameters

h_or_name (Union[XMLParseHelper, str]) – The parse helper to set or the name of the helper.

See also GetXMLHelper().

class XMLParseHelper

Helper class for parsing XML documents.

Provides methods to handle embedded objects and assist in the parsing process.

See also XMLObject.

Methods:

GetObject(type, opaque_identifier)

Retrieves an embedded object during the parsing process.

IdentifyObjects()

Identifies embedded objects within the XML content during parsing.

GetObject(type: int, opaque_identifier: bytes)Pro.Core.NTContainer

Retrieves an embedded object during the parsing process.

Parameters
  • type (int) – The type of the object to retrieve (e.g., XMLObj_File).

  • opaque_identifier (bytes) – An identifier used to locate the object.

Returns

Returns the embedded object as an Pro.Core.NTContainer.

Return type

NTContainer

See also IdentifyObjects().

IdentifyObjects()None

Identifies embedded objects within the XML content during parsing.

This method scans the XML structure to locate embedded files or scripts.

See also GetObject().