XML import/export

From Qubit Toolkit
Jump to: navigation, search


[edit] Command-line import of XML files

Release 1.2 introduced a command-line tool for importing single XML files or multiple XML files in a single directory. Running the import from the command-line bypasses web server and client timeout limits and allows the import to run for hours, or even days, without interruption.

The following is the output from the "help" page for the import:bulk task:

 symfony import:bulk [--application="..."] [--env="..."]
 [--noindex[="..."]] [--schema[="..."]] [--output[="..."]]
 [--v[="..."]] folder

  folder         The import folder or file

  --application  The application name (default: qubit)
  --env          The environment (default: cli)
  --noindex      Set to 'true' to skip indexing on imported objects
  --schema       Schema to use if importing a CSV file
  --output       Filename to output results in CSV format
  --v            Verbose output

  Bulk import multiple XML/CSV files at once

The only required argument is the "folder", which despite it's name can point to a single file. The import:bulk task supports import of EAD (archival descriptions), EAC-CPF (authority records) and SKOS (Subject thesauri) XML schemas as of Release 1.2.

The following sections provide background to the development of XML import. These sections are aimed at developers interested in understanding the technical design process and reading discussions led by Qubit developers.

[edit] Code

XML is the primary standard that will be used to exchange data between Qubit and other systems, including other instances/deployments of Qubit.

The import code is contained in:

  • lib/model/QubitXmlImport.class.php

See the QubitXmlImport documentation for details on the class methods.

Since XML import/export may be used for all types of Qubit objects, this feature is developed in the core Qubit Object module, as all the primary Qubit data objects are extended from QubitObject (see Data model diagram).

These templates are accessed by users via the Admin -> Import/Export menu option.

The import/export actions & templates are supported by a series of data mapping templates.

[edit] Export

  • Qubit exports XML by default, but can export in any textual format via XSL.
  • Valid export formats are specified in a select array in importexportSuccess.php.
  • Export mappings are either: a) a template-based XML file for the given schema, or b) a YAML file pointing to a given XML template and a series of optional XSL transformations.
    • An example of a straight XML export would be the ead.xml file, and an example of an XSL-based export would be marc.yml (again, in apps/qubit/lib/export):
templateXML: ead.xml
    • Note that in the latter, data is exported into the EAD2002 XML template, and then transformed into MARC21XML for export.
    • This same approach could be used to export any text format, eg. CSV, LaTeX, whatever. An export will default to the YAML file for that schema if it exists (eg. marc.yml), and the XML template if it does not (eg. marc.xml).

[edit] Using Symfony templating + components system for XML export

  • another option is to use the Symfony templating and components system to export XML data (e.g. like it does with XML feeds)
  • this may be an alternative or complimentary method to the more comprehensive XML import/export framework that has already been created
  • there is a simple working proof of concept for this method for exporting OAI Dublin Core metadata into the OAI <record> element contents as part of the OAI getRecord verb. see http://code.google.com/p/qubit-toolkit/source/detail?r=1722

[edit] Bulk Export via Command Line Interface (CLI)

In AtoM 1.4, the ability to conduct bulk EAD exports via the CLI has been added, in anticipation of the future development of a graphical user interface (GUI) to support such tasks for end-users. Users who are comfortable with use of the command line, and who wish to export multiple archival descriptions via the CLI, can do so with the following steps.

To make use of the CLI Bulk Export:

  • Navigate to where the AtoM folder is located on your server/comptuer after opening the command line console
    • Use CLI commands such as cd to change directories, ls to see what's inside the current directory, pwd to see what folder path you're currently in.
  • All CLI commands for AtoM typically go through the symfony application framework (v. 1.4). To run commands through Symfony, begin commands with: php symfony
  • Create a directory (folder) where you want the EAD files to export. Create a directory with mkdir myeadfolder, where "myeadfolder" is the name of the target export directory that you would like to use.
  • Use the command: php symfony export:bulk myeadfolder to export AtoM's entire database, where "myeadfolder" is the name of the target export directory you have created in the steps above.
    • NOTE that bulk export of all fonds is not currently recommended without a job scheduler to manage the work, and will require consideration of PHP execution limits, etc. to ensure that the request can be completed. Job scheduling is being included in a future release of AtoM: see: https://projects.artefactual.com/issues/239
  • To pick specific archival descriptions / information objects for export, raw SQL can be added to the task to tell it specifically which descriptions should be exported.
    • Example: To export 3 fonds titled: "779 King Street, Fredericton deeds," "1991 Canada Winter Games fonds," and "A history of Kincardine," You can issue the command: sudo php symfony export:bulk --criteria="i18n.title in ('779 King Street, Fredericton deeds', '1991 Canada Winter Games fonds', 'A history of Kincardine')" myeadfolder
  • You can add additional archival descriptions of any level of description into the query by adding a comma then another title in quotes within the ()s.

[edit] Import

  • Import mappings are written in a YAML file (in apps/qubit/lib/modules/object/import) as crosswalks to Qubit objects; eg.
   XPath:  "//archdesc | //c"
   Object:  InformationObject
   Parent:  "ancestor::archdesc | ancestor::c"
       XPath:   "did/unittitle"
       Method:  setTitle
       Parameters: [$nodeValue]
     # ...
  • In this example, "information_object" is the name of the mapping, and when the XPath expression is matched, it will generate a QubitInformationObject for every node in the result set. Qubit can import objects of any type in its internal data model.
  • If the "Parent" XPath expression is set, Qubit will link the new object to the last matching element in the result set.
    • Note: XPath expressions are forward-axis by default; reverse-axis XPath expressions should be wrapped in parentheses () to force them to forward-axis. [We may be able to force this in code -MJS Jan.13]
  • It will then apply, in order, each of the XPath expressions listed in the "Methods" array, and execute the corresponding method on the object (see below).
    • Note: Both "Parent" and "XPath" elements are evaluated relative to the current XML node that is being traversed in the hierarchy. Absolute XPath expressions can be achieved using the / root element in XPath syntax.
  • If an XPath expression matches more than one node, then the method is executed once for each match. This can be limited within the XPath expression itself (eg. by using the (...)[1] syntax).
  • Parameters to the method can optionally be specified; if no Parameter is defined in the trigger, it defaults to $nodeValue, the string value of the node matched in the XPath expression.
  • There is also a $nodeXML parameter that can be used, which contains the full XML of the node and all descendant nodes.
  • Other examples of parameters include:
Parameters: [$null = null]
Parameters: $date = now()
Parameters: [QubitTerm::ARCHIVAL_MATERIAL_ID]
Parameters: [$name = 'information_object_language', $nodeValue]
Parameters: [$nodeValue, "$options = array('scope' => 'languages', 'sourceCulture' => true)"]
  • Parameters can also include a % character followed by an XPath expression relative to the current XML node being traversed. If an XPath expression matches a node-set, the results will be turned into an array. For example:
Parameters: [$nodeValue, %@normal]
Parameters: [%/root/item/name]
Parameters: ["%creator[@type='author']/fullname", $date = '02-07-2009']
Parameters: [$nodeXML, %@id, %//item/@type, "$var = array('value' => $nodeValue, 'time' => time())", QubitName::TEST_NAME]
  • Parameters can also include non-static methods from the current object by referencing $this->currentObject. For example:
Parameters: [$nodeValue, $this->currentObject->getId()]
  • Mappings are executed in order, from top to bottom in the mapping file.
  • This example would create a new QubitDigitalObject for each <archdesc> and <c> element that exists in the XML file, and would link them to their parent <archdesc> or <c> element in the document hierarchy. It would then call the setTitle() method, with the $nodeValue parameter containing the string value of any elements at did/unittitle, relative to the current <archdesc> or <c> element.
  • Import mappings can also execute a series of optional XSL transformations before parsing the XML; an example of this is when an EAD1.0 file is detected, it is first transformed into EAD2002, and then the standard EAD2002 mapping is used.
  • Import formats are automatically detected by looking in a number of useful places in the incoming XML document, in particular the document type and XSI schema declarations. These are matched against a list of valid import formats located in importAction.class.php, which point to the appropriate YAML mapping to use for the import.
  • Anything not defined there will not be detected, and thus is not importable by Qubit.
    • this was considered to be core-enough functionality that only developers should probably be editing them, rather than provide a user-configurable setting to store these values.

[edit] Multi-level import

  • Some XML documents that Qubit will have to import will include nested, hierarchical elements, each of which will require the instatiation of a seperate Qubit object that needs to be linked in the correct place within the hierarchy tree.
    • the most obvious example is EAD documents that contain hierarchical data within the <dsc> element
      • the <dsc> element can contain 1..* <c> elements, each of which can be nested in other <c> elements to create a hierarchical tree structure (optionally an institution can also use the <c01>, <c02>...<c12> convention to create a maximum of 12 levels)
  • The current YAML syntax will support multi-level object import, as well as cross-linking between imported objects, provided they are imported in order, ie. InformationObjects must be imported before DigitalObjects in order for them to be linked properly; ensuring the YAML mappings are in the correct order is the easiest way to achieve this.

[edit] Importing large data sets

[edit] Disable the search index

The easiest way to reduce the resources required for the import is to disable Search indexing during the import.

To disable the search index:

patch -p0 < disable_search_index.patch
  • Import your XML data
  • Re-enable search indexing (remove patch)
patch -p0 -R < disable_search_index.patch

[edit] PHP settings

You may also need to increase the PHP script limits in your php.ini file to allow the import to run to completion.

Several PHP settings may cause the import to fail

[edit] Crosswalk recommendations

  • XML Import/Export raises questions about the design of the Qubit core data model:

Qubit is still a very ICA-centric (ISAD/ISAAR) application, which is heavily reflected in the internal metadata elements. This means that a crosswalk has to be written from the Qubit schema TO AND FROM every single import or export schema. That is a *HUGE* amount of difficult and tedious work for even the most experienced information specialist. I realize that we're stuck hard-coded to the current ICA-centric data model for the forseeable future, but I would suggest that a quick and effective way around this limitation would be to write some good-quality Qubit-to-MARC21XML and MARC21XML-to-Qubit mappings. The reason for this is that there are a group of well-defined MARC21XML crosswalks provided by LOC (in apps/qubit/lib/xslt), which would basically allow us to import and export to/from any of the various XML formats defined. By writing only 2 crosswalks, we get over 20 import/export formats by leveraging all of the hard work done over the past 5-10 years or so.

Crosswalks inherently have a risk of being lossy, but as long as we have an internal data model that is hardcoded, there will always be loss in every import and export. This is one of the major struggles in trying to provide an application that can satisfy all the disparate information sciences. I'm not a fan of MARC by any means, but the reality is, it gives us the most bang-for-the-development-buck, and you've got to start somewhere. When (and specifically, not "if") it comes time to refactor the internal metadata schema, I'd strongly propose a very minimal set of core elements -- basically those that map cleanly to (almost?) every schema, so probably something like the DCMES 15 that are hardcoded, and the rest dynamic MARC-based fields, with special additions as necessary for specific mappings.

  • The current Qubit core data model for information resources is based on ICA-ISAD(G) and Dublin Core. That is to say there is a direct 1:1 mapping between each ISAD or Dublin Core element to one of the elements in the Qubit Information Object (see information_object and information_object_i18n in the schema.yml file).
  • The current approach is to treat the Qubit core data model as an intermediate metadata format, like the Dspace project has done:
    • "This format is designed to be a straightforward and precise representation of the DSpace data model's Item metadata. It cannot be overemphasized that this is strictly an internal metadata format. It must never be recorded, transmitted, or exposed outside of DSpace, because it is NOT any sort of standard metadata. We must not allow it to "escape" to prevent its being mistaken for an actual supported and sanctioned metadata format. It exists to support internal transformations only."
  • I think that what MJ is suggesting is that switch over to use MARC as our core data model because it will serve as a better intermediate data model, considering that the Library of Congress maintains a number of MARC XSLT crosswalks to other standards (Clarified Dec 2008; the idea is only to use the XSLT mapping feature of the import/export framework to use the LOC MARC crosswalks in the short term if/as necessary)
  • However, the intermediate metadata format approach, regardless of whether we use the current Qubit core, MARC or something elese, would either require us to update the Qubit core metadata everytime we add support for another standard that includes elements not already in the core, or we have a lossy system where we just ignore elements that aren't already in the Qubit core
    • neither is a desirable option
  • therefore we have created QubitProperties which are key:value pairs that we can add to any core Qubit object to store data values not already supported by the core data model
    • we recently used this to add Rules for Archival Description support
  • however, this does raise an export issue:
    • if we have an export mapping based on the Qubit core, it will miss data values that are potentially stored in QubitProperties
      • for example, if I created an informationObject using a RAD template and then use the Qubit->EAD export, it may miss some of the RAD fields that the user has entered
    • one way to address this is to create more explicit export mappings that indicate both the source and the target metadata format, eg. ISAD-to-EAD, RAD-to-EAD
      • this would require us to track which encoding format was used to create or import the information object (e.g. the source) so that we can automatically select the correct source-to-target crosswalk upon export
    • another way is to combine the ISAD-to-EAD and RAD-to-EAD export logic in one EAD export logic which is sensitive to both ISAD and RAD properties
    • one way to facilitate this would be to have an XML dump for a QubitObject that includes its core elements as well as any additional Properties linked to that object (grouped by Property 'scope' and 'name'). Then a XSLT source-to-target stylesheet could be run on the dumped XML document, looking for any applicable Qubit core and property values as required by the mapping.
      • three downsides to this approach are that, like DSpace, it "must never be recorded, transmitted, or exposed outside of DSpace, because it is NOT any sort of standard metadata", it adds an unnecessary, additional step between the PHP objects and the target XML, which may increase the burden of maintenance, and unless PHP is clearly not the tool for the job, using a different language adds complexity to the project, making deployment, integration, and accessibility to new developers more difficult.

[edit] Speed optimization

Testing by MJ and David revealed that building the Zend Search Lucene search index for imported information objects is consuming a large portion of the system resources on import (kcachegrind tests show in excess of 60% of total script execution time). Optimizing the search index build should therefore return large performance gains.

David did a quick test importing an EAD XML file with 128 archival descriptions ("Alexander Cameron Rutherford Fonds") with various levels of indexing.

  • Test 1: QubitInformationObject and all related objects run SearchIndex::updateTranslatedLanguages() - Time: 98 seconds
  • Test 2: only QubitInformationObject runs SearchIndex::updateTranslatedLanguages() (x2) - Time: 42 seconds
  • Test 3: No update of search index - time: 4 seconds

[edit] Remove double ->save() calls XmlImport library 

On each iteration through the DOM tree, each object is saved twice, once to obtain a primary key from the database and the second time to save the object with all appropriate data.

  • Pros: This should cut calls to the search index build in half
  • Cons: It requires some major revision to the import engine to delay insertion of foreign key related objects until after the initial save of the primary object.

[edit] TODO

  • Digital Object save - defer save until last possible moment
    • For ::create() method - remove save of digital object, just build object and return it to allow parent call to determine proper save time
    • Defer creation of Digital Object on filesystem, path information and derivatives until save called (insert into save method)
    • This will allow adding Digital object to Info save object stack and remove double save.

[edit] Remove multiple builds of QubitInformationObject document

Currently the system is re-adding the QubitInformationObject to the index after several related objects are saved, including: QubitNote, QubitEvent, QubitObjectTermRelation, etc.

Methods for removing redundant calls to SearchIndex::updateTranslatedLanguages()

  1. Add a flag to the ->save() method for each class that would display indexing (e.g. ->save($options = array('skipIndexing' => true)))
  2. Add a flag (and method) to each class that would toggle indexing on save of the object (e.g. $qubitEvent->noIndex(true))
  3. Add a new save method to each class that would avoid the indexing action (e.g. ->saveNoIndex())
  4. Build index only once - when saving QubitInformationObject (parent) object

Option #4 would require one of the following:

  • Implement Option 1-3 (or some other modification of the current "save()" behaviour) for the "child" classes (QubitEvent, QubitNote, QubitObjectTermRelation, etc.)
  • Remove search index update altogether when saving related objects (QubitEvent, QubitNote, etc.)
  • Breaking the index into multiple document types (currently just storing information objects) so that update of e.g. QubitNote $bar belonging to QubitInformationObject $foo only triggers adding a index document for $bar and not for $foo.

[edit] Discussion on June 10th

(11am - 12pm approx.)

  • Participants: Jack, Peter, David

[edit] Decisions

  • Don't break up search index document into multiple documents because of speed implications of writing to index (or indexes) multiple time per IO
  • Jack to investigate Option #4 - adding related objects to info object and doing save of all objects from info object and then update index.
    • Implication: will break update of search index when related objects are updated (because index update is done solely from info object save)
  • If Option #4 not doable within a week then go with Option #1 or #2 - should be fast to implement

To investigate/implement later:

  • Create separate search index or separate document within current index for each Qubit object type
    • make faceted search possible?
    • may critically impact performance for import do to large number of separate index writes (similar to current speed problems)
    • does address outstanding requirements to enable search on other objects (issue 165), (issue 416), (issue 767)
  • Create design page for Propel vs. Doctrine
    • document current limitations of Propel
      • known issue: SQL call for moving menus via nested set are not going through ORM (direct MySQL calls)
      • can't build object hierarchy in memory then save to database at last possible moment (at end of request)
    • Doctrine - may allow storing data from current request in ORM until last minute, then save to DB to pass data to new request

[edit] Import/Export errors [Updated to ICA-AtoM 1.3 / 2012]

[edit] ISAD(G) EAD round-trip export/import errors

Field Result In export file? Comments
Identifier missing attributes for repository and country, esp. at lower levels of description yes See issue #3987
Start year missing yes fixed
Parent level see comments Parent level appears only if it is part of a child-level description that was imported along with its parent level. So if you import a series, the series-level description will be missing its parent level but any child-level descriptions belonging to the imported series will include it.

pv: How do we know what the parent level is supposed to be if it is not indicated anywhere in the import file? I assume what is implied here is that Qubit recognizes the series upon import as one that already exists in the database, it then updates the existing series description with any changes/updates in the import file, while maintaining the link to its parent level. If so, this capability can be called 'merging' and is linked to a broader 'versioning' capability which we will begin to develop in release 1.1 (see issue #912)

Script of material missing no pv: not sure which EAD element or attribute to use for this. There is a @scriptcode attribute for the <language> element but that means the script and the language have to be related. That is currently not how we are entering them in Qubit. The list of 'language of material' and 'script of material' values can be populated independent of each other. Any suggestions?

Two suggestions from Giovanni Michetti, UBC School of Library, Archival and Information studies professor:

<langmaterial><language scriptcode=”” langcode=””>LANGUAGE GOES HERE</language>GENERAL NOTES GO HERE<l/angmaterial>

OR (less elegant but allowed in EAD):

<langmaterial><language>LANGUAGE GOES HERE</language><language scriptcode="">SCRIPT GOES HERE</language>GENERAL NOTES GO HERE </langmaterial>

SCRIPT CODE will be taken from: ISO 15924:2004

Physical characteristics and technical requirements misplaced yes fixed

merged with Exent and medium field; needs to be imported seperately in <phystech> field

Existence and location of copies missing yes fixed
Publication notes missing yes fixed
Notes missing yes fixed
Place access points missing yes fixed
Name access points Additional name access point created yes fixed
Archivist's notes missing yes pv: in release 1.0.8. Mapped to the author tag in the <eadheader> section using the value for Archivist's Note from the highest level that is being exported. Need to figure out how to map back in.

dg: <processinfo> tag supports the TYPE attribute, as well as the ENCODINGANALOG attribute. Suggest employing these to solve import errors - encodinganalog to map it to the proper field (so it is not confused with dates of creation, revision, etc.), as well as the TYPE attribute to ensure separation and add a human readable label to the EAD - e.g., <processinfo type="archnote">. See also notes below under dates of creation, revision, deletion.

Description record identifier missing no pv: which EAD element to map to? Remember this field does not yet exist in ISAD(G) it was added to the control area in anticipation of the ICA updating all control areas to match their later standards format. <eadid> is using the Qubit identifier which is better from a technical point of view as were are then able to offer resolvable URI for the finding aid.

dg: This field belongs to ISDIAH rather than ISAD(G), and refers to identifying "the description of the institution with archival holdings uniquely within the context in which it will be used" (ISDIAH 5.6.1). As such, it currently bears no direct equivalent in EAD and should not normally be expected to be exported/imported using EAD. However, if we do wish to preserve user-input data in this field (as removing it from the AtoM archival description template risks losing user-input data), one possibility might be to simply use the <odd> tag with a specific ENCODINGANALOG attribute to map it to ISDIAH instead of ISAD(G), an ID attribute to link it to any data input in the Description identifier field of a linked repository from the ISDIAH template from Archival institutions, and a TYPE attribute to specify its function (e.g. type="descid")?

See also: issue 2321, issue 912, and issue #2668 for other proposed solutions and workarounds

Institution identifier missing no pv: which EAD element to map to? remember this field does not yet exist in ISAD(G) it was added to the control area in anticipation of the ICA updating all control areas to match their later standards format. Right now we are using the name of the linked Repository as the <publisher> value in the <eadheader>

dg: This field belongs to ISDIAH rather than ISAD(G), and refers to identifying "the agency(ies) responsible for the description." As such, it currently bears no direct equivalent in EAD and should not normally be expected to be exported/imported using EAD. However, if we want to include it in an EAD export (since it cannot be removed from the AtoM archival description template now without risking the loss of user data), one workaround may be to use either the <odd> or possibly even the <repository> tag, with an ENCODINGANALOG attribute that will map it to ISAAR(CPF) instead of ISAD(G), and an ID attribute that will link it to an existing authority record in AtoM to populate the field?

See also: issue #2668 for other proposed solutions and workarounds

Rules or conventions missing yes pv: <descrules> element is in the <eadheader>, right now we are only mapping elements within the <archdesc> tag. Need to resolve in later release. See issue#1013
Status missing no pv: which EAD element to map to? Remember this field does not yet exist in ISAD(G) it was added to the control area in anticipation of the ICA updating all control areas to match their later standards format.
Level of detail missing no pv: which EAD element to map to? Remember this field does not yet exist in ISAD(G) it was added to the control area in anticipation of the ICA updating all control areas to match their later standards format.
Dates of creation, revision and deletion misplaced yes exported as <processinfo> but placed in Archivist's notes field

pv: Archivist's note field from the highest level is used for the <author> element in the <eadheader>. For the lower levels it continues to be combined with 'Dates of creation, revision and deletion' in the <processinfo> element. See r3256. Need to figure out how to map these back properly for roundtripping

dg: Issue seems to be on import, not export - possibly because it is using the same EAD encoding as the archivist's notes (<processinfo>), with no attributes. Suggestions include:

  1. Adding a TYPE attribute to the process info tag to differentiate between its use for Archivist's notes and Date
  2. The <dates> tag may contain PCDATA - therefore, Dates of creation etc. should be wrapped as such: <processinfo><date>, while the Archivist's note can remain <processinfo>
  3. <date> may also carry a TYPE attribute. For better data management, suggest making this a repeatable field with a radio button for creation, revision, deletion. Depending on the radio button selection, the field is encoded as <processinfo><date type="creation"> (or "revision", "deletion"). If the user does not select a radio button, the field is simply mapped as type="".

See also: issue #2668 for other proposed solutions and workarounds

Languages of archival description missing no pv: As we can only include export an EAD document in one language at a time (with any source culture fallback values where necessary), the <langusage> element in the <eadheader> only reflects the value of the language in which the EAD file is exported along with the fallback value. Any suggestions which element to use to indicate that other translations of the archival description are available?

dg: <language> is a repeatable field when nested inside <langusage> in EAD, and the EAD Tag Library notes that "for bilingual or multilingual finding aids, either identify each language in a separate <language> element, or mention only the predominant language." We should be able to set the export EAD to note all instances of a single finding aid in the EAD, even if this information is not displayed on the show screen at the time of export (as only one language of description is displayed at a time). <language> attributes such as ID (a linking attribute) can be used to maintain a relationship with the other languages entered; it is possible that the AUDIENCE attribute could be used to set the exported language as external and the rest as internal, so on import AtoM will know which of the <language> tags to display in the Languages of archival description field - if other languages are then also imported/linked, this audience value can be changed to external, making them visible in the description showscreen once more.

See also: issue #3250, issue #2668

Scripts of archival description missing no pv:unlike 'scripts of material' it should be possible to tie the 'scripts of archival description' to a specific language element as its attribute value (<langusage><language scriptcode="">) because the value is of the scriptcode attribute is equal to the script in which the EAD content is output. Entered as a 1.1 issue #913; See also: issue 1300, issue #2668
Sources missing no pv: which EAD element to map to? remember this field does not yet exist in ISAD(G) it was added to the control area in anticipation of the ICA updating all control areas to match their later standards format.

dg: This field belongs to either ISAAR-CPF, and refers to sources consulted in establishing an authority record, NOT an archival description (ISAAR 5.4.8), or it comes from ISDIAH, and refers to sources consulted in the description of an archival institution (ISDIAH 5.6.8). Typically ISAAR is encoded in EAC, and users should not expect this data to export in an EAD description. However, now that it has been deployed in AtoM in a manner that users expect it to be captured on export, it cannot be removed from the AtoM description template without risking user data loss. Therefore, one possible solution is to encode it similarly to the Archivist's Notes field (which in ISAD(G) is described similarly to ISAAR's 5.4.8: "record notes on sources consulted in preparing the description..."). The tag <processinfo> with a TYPE attribute (where type="source") could be used as a method to map this data to its source field, if we expect it to be contained within an EAD export.

Location (Physical Storage) missing yes see issue 643, issue 1901, issue 1966 and related discussions.

Suggestions from Giovanni Michetti, UBC School of Library, Archival and Information studies professor:

EAD Tag Library advises the use of the LABEL attribute for specifying the type of storage, but this is an improper use of the LABEL attribute - use the TYPE attribute instead. When encoding for export, <container> should be used for the actual container, and controlled by the TYPE attribute: ie, box, hollinger box, cardboard box, folder, reel, etc. <physloc> should only be used for locations, such as shelves, map cabinets, etc. Suggestion is rework our user interface to better match the encoding by altering the dropdown list in light of this encoding proposal and removing physloc types from the container type list – such as shelves, map cabinet etc. (If desired, these and other related options can be added to a physloc dropdown field beside the free-text field, as <physloc> also supports the TYPE attribute. Ideally, the encoding of these fields should nest the <container> tag within the <physloc> tag to keep them associated, as below:


[edit] Dublin Core to EAD round-trip export/import errors

* jb: updated chart 12/12/2012

Field Result In export file? Comments In import file? Comments
Creator problem on import yes no Imported file has "Untitled" in Creator field
Publisher/ Contributor problem on import yes no Date of publication and date of contribution present BUT NO NAMES
Place of Creation/Publication/Contribution problem on import yes no
Type missing no no
Relation - parent level misplaced yes placed in Relation field; link disappears and is replaced by quotation marks (e.g. is part of "Test fonds") <c><did><unittitle encodinganalog="3.1.2">XXXX</unittitle><unitid encodinganalog="3.1.1>XXXX</unitid></did></c> ? My test data is a collection with a series of photographs. Is this what is being meant by Relation- parent level?
Relation see comments yes missing if Relation - parent level field is present because it is overwritten by data from that field ?? (see comment made about Relation-parent level) ?? ?? (see comment made about Relation-parent level) ??
Coverage(spatial) Places entered directly into the coverage(spatial) field are present yes <geogname>XXXX</geogname> yes It is confusing because places entered in the AtoM actor dialog are shown as hyperlinks in the coverage(spatial) area in the view archival description. But upon Import these places are missing due to a glitch with places related to an actor.
Name (Physical Storage) yes <physloc>


<container type="xxxx"> XXXX</container></physloc>
Location (Physical Storage) yes <physloc>


<container type="xxxx"> XXXX</container></physloc>
Container type yes <physloc>


<container type="xxxx"> XXXX</container></physloc>
Errors encountered upon Importing DC.EAD file created with AtoM 1.3, See: Warning Box

[edit] Dublin Core to DC.XML round-trip export/import errors

Unable to test round-trip IMPORT errors with Dublin Core due to Issue 4302: DC.XML Import not working in 1.3 ICA-AtoM - JBushey 2012-12-11

Field Result In export file? Comments
Language present see comments on export this element is NOT coded using ISO-630 (2 letter code for languages) see Issue 4451
Rights holder missing no -see comments this DC field is not included in ICA-AtoM template or XML export
Provenance missing no -see comments this DC field is not included in ICA-AtoM template or XML export
Instructional method missing no -see comments this DC field is not included in ICA-AtoM template or XML export
Accrual method missing no -see comments this DC field is not included in ICA-AtoM template or XML export
Accrual periodicity missing no -see comments this DC field is not included in ICA-AtoM template or XML export
Accrual policy missing no -see comments this DC field is not included in ICA-AtoM template or XML export
Audience missing no -see comments this DC field is not included in ICA-AtoM template or XML export
All other fields included in AtoM DC template properly exported as of version 1.3.

[edit] RAD to EAD round-trip export/import errors

pv: set RAD->EAD export mapping using the CCA crosswalk jb:This table has been updated with testing done in ICA-AtoM 1.3 Date: 2012 November

Field Result In export file? Comments
Parallel title present yes
Other title information present yes
Statements of responsibility (title) present yes
Title notes present yes
<odd type="titlesource"><p>XXXX</p></odd> (This is the source of title proper) and
 <odd type=" "><p>XXXX></p>/odd> (This for other notes, eg.conservation etc.)
Parent level see comments Parent level appears only if it is part of a child-level description import. For example, a fonds with series and items must be imported as the entire fonds to include all levels. If the series is imported without the fonds level EAD, only the series, and items will be available - the fonds level will be missing and there is no indication in the EAD that there is a parent level description. Is this related to the discussion in Issue 912 regarding versioning?
Edition Statements present yes
Edition Statement of Responsibility present yes
Statement of scale (cartographic) present yes
Statement of projection (cartographic) present yes
Statement of coordinates (cartographic) present yes
Statement of scale (architectural) present yes
Issuing jurisdiction and denomination (philatelic) present yes
Title proper of publisher's series present yes
Parallel title of publisher's series present yes
Other title information of publisher's series present yes
Statement of responsibility of publisher's series present yes
Numbering within publisher's series present yes
Note on publisher's series present yes
Dates of creation See comment yes Single event broken into two events; link between creator and event is broken. See issue #1737 (Still occurring in 1.3)
Event type See comment yes Default event type is <corpname> but should be <name> until authority record is finalized and TYPE is populated with either: corporate, person, or family. See issue 4418
Event note See comment missing Event note is not being captured in EAD export. Suggested fix, see Issue 4419 and RedmineIssue 3686
Physical description multiple entries not separated into separate <extent> fields yes, but all combined into one field On import, if there are multiple <physdesc> fields or multiple sub-elements such as <extent>, only the last field is imported. See discussion thread at http://groups.google.com/group/ica-atom-users/browse_thread/thread/f8eaf16eedb0aaf3?hl=en (Retested in 1.3 and all fields are being imported, but no separation regardless of format. You can use carriage return or EAD tags in the physical location data entry field of the ICA-AtoM template, but once they are exported everything is lumped a single <physdesc><extent> tag.Upon import all physical descriptions and extents etc. are connected without punctuation or delineation.)
Physical condition note see comments yes merged with Physical description field; in xml file <phystech> is sub-category of <physdech>
<b>(Is this still true if we use <physdesc><extent encodinganalog="3.1.5"> for physical description and
<phystech encodinganalog="3.4.3><p> for physical condition in 1.3 export)</b>
Script of material missing no See RedmineIssue 2863 and RedmineIssue 4508
Availability of other formats present yes
Other notes present yes
Standard number present yes
Name Place and Subject access points yes present
Description record identifier see comments RAD doesn't have a description identifier or reference code as part of the standard. AtoM does provide one in the edit template and on Export EAD XML. The header includes <eadid countrycode=" " url=" " encodinganalog="Identifier">XX</eadid> and the <archdesc> includes <unitid countrycode=" " encodinganalog="3.2.1">XXXX<unitid>. See Issue 912 and Issue 2321. In 1.3 this is inconsistently supported. In some cases you can get countrycode included in the unitid element. In other cases it is missing. And then in other cases you have the repositorycode inserted. See Issue 2039.
Institution identifer present yes In the header we include the institution ID from the ISDIAH and then include the archival description ID all within the <eadid> element:<eadid mainagencycode="XXX" url="" encodinganalog="Identifier">XXXX</eadid>

In the <archdesc> we pull the institution ID from ISDIAH and then include the archival description ID all within the <unitid> element: <unitid repositorycode="XXX" encodinganalog="3.1.1">XXXX</unitid>

Rules or conventions present yes data exported as <descrules encodingnalaog="3.7.2">
Status present yes
Level of detail present yes
Dates of creation, revision and deletion present yes data exported as
Languages of archival description imports incorrectly see comments See RedmineIssue 4424
Scripts of archival description missing no See RedmineIssue 4508 and RedmineIssue 2863
Sources present yes
Location (Physical Storage) see comments yes/no data exported as <physloc>xxx<container type="">xxxx Physical container is being exported but only when it is included in a multi-level description. The physical container for the fonds is not recorded in EAD export, but the physical container for the sous-fonds (child level) is recorded in the EAD export. Upon Import the physical container for the sous-fonds is accurate and represented, but not for the fonds level. See Issue 1901 See Issue 1966. See Issue 643.

[edit] Notes

[edit] Release 1.0.8 Export Refactor

Release 1.0.8 introduced a change to the way that records are exported. This is now done directly from an individual object's show template (see the 'export' label in the context menu) to avoid having to load a resource-intensive tree of the entire object hierarchy to select the correct object to export.

Also, the structure of the export code was changed to use Symfony templating.

[edit] Release 1.1 Import Refactor

There will be a restructuring of the import code organization to make the ongoing development of this feature more flexible.

As well, there are a number of performance issues for import. These are largely related to the search index performance. The following changes are proposed for release 1.1:

  1. upload XML file
  2. validate XML
  3. parse the XML file, store each individual Qubit information object in the dbase as an XML blob in an QubitImportObject
  4. import each object, one at a time, i.e. run it through the import mapping and save to database but only index the title & publication status. Also, only save one object per request. Return to status bar after each object import. Can manually force request for next object to import or use AJAX to automatically put in request to import next object (allows for the import to be interrupted for whatever reason and pick-up where left-off). Also, avoids having to put in one massive request that breaks php/apache timeout/size limits.
  5. batch update the search index (only the newly imported objects). Can be cron or user-triggered?

User:David Is it worthwhile to build an cron job for doing a daily, incremental update of the index? This update would search for updated or new records based on the "updated_at" column of the q_object table, and store a "last update" value in the db or a file.

[edit] Email thread: Jan 2011


> These are just some notes about how to break up the current import process to
> separate all the tasks that are currently happening in a single request. It
> still assumes a single file at a time. However, I think the principle might
> still be the same. We'll need to add the ability for the user to select multiple
> files via their browser file upload dialog. We can make that a single request
> and upload all the files to a tmp directory specifically for that import. Then
> run any number of additional processes in the background to work our way through
> importing each file.


Allowing the selection of multiple files through the browser upload dialog is
problematic because it is not supported by HTML or Javascript. We can use the
same Flash library that we are using for the digital object upload, but it's
proven to be difficult to debug and maintain, and isn't usable for users with-
out access to Flash (e.g. due to organizational security protocols). I'd
much prefer to upload the batch upload via the command line, which allows
specifying any number of arbitrary files or directories and I think is a better
fit for a long running process like batch upload in any case.

[edit] Post 1.1

User:David I think for post 1.1 we need to really re-evaluate Zend Lucene and consider: 1) What functionality do we need from a full text search index 2) What other tools might provide this functionality (Solr, Java Lucene, Sphinx, MySQL fulltext search) 3) What tools are going to give the best compromise between indexing speed and search speed

From this forum thread it sounds like Java Lucene (as opposed to the PHP Zend version we are currently using) is no speed demon on the indexing side. The quoted speed is 6 hours for 3 million records(!) on a pretty unimpressive server (Dual 2GHz Xeon with 1GB RAM). Of course this may be fast enough for our needs (though this is difficult to determine because we don't have a definite limit on collection size). It just makes sense that Lucene is optimized for search speed, which means a hit on the indexing side - perhaps this will be true of all full-text search engines?

User:mj Indexing performance depends a lot on the implementation; ie. even with Lucene it can vary between Solr, ElasticSearch, etc, especially in how bulk indexing is handled, document size, etc. For a counter-example, have a look at the LAC testing page in the section entitled "Early NoSQL Prototype Background" (at the bottom). On a 2GHz Core 2 Duo with 2GB RAM, indexing 2.5 million records took about 15 minutes using ElasticSearch.

[edit] Resources

Personal tools