Record Merging Algorithm - for MARC records

Pool selection (potential duplicates)

A pool of potential duplicates is obtained from the database by searching on the following data elements from the incoming record:

  • LCCN (010 a- normalized)
    • The LCCN has the following format:
      010 ##$a###79139101#/AC/MN
      The index should include only the numeric portion. Skip any leading non-numeric characters, and index to either the first blank or end of subfield.
  • ISBN (020 a - normalized)
    • The ISBN in MARC records has the following format:
      020 ##$a0394502884 (Random House) :$c$12.50
      020 a 074253779X (pbk. : alk. paper)
      020 a 9780742537798 (pbk. : alk. paper)
      The ISBN is at the beginning of the subfield, but other data can follow. Select the first string in the field, including the ending "X" if it is there. Because of the recent change to a 13-character ISBN, it is probably best to convert all incoming ISBNs to ISBN-13. The alternative is to allow an embedded match between the first 9 digits of the shorter ISBN and the longer one.
  • First 25 characters of the normalized title (245 a, b) - normalized)
    • Normalization generally removes subfield coding, punctuation, and diacritics. Depending on how you have handled the conversion to Unicode, the removal of diacritics may not be possible.

We should add OCLC numbers to this search. They are a bit more complex because there is no standard place to store them in MARC records. The OCLC number can be identified by the use of "OCoLC" in conjunction with the number. The OCLC number may contain a prefix of "ocm" or "ocl7". Remove the "ocm" or "ocl7" from the beginning of the number for searching and for matching. They are most commonly found in two places:

  • In the 001/003 of the MARC record
    001    62525112 
    003    OCoLC 
    or in an 035 field
    035 __ |a (OCoLC)ocm51050179
    In this latter case, the OCLC is identified by the (OCoLC) preceding the number.

When all of these searches have been done, you have the pool of records against which the merge algorithm will act. Depending on how searching is done, you may have the same record in the pool more than once. If possible, eliminate those duplicates (or design the search and pool so that they don't happen.)

Level 1 Merge

The merge takes place in two stages. The first level merge allows records to merge on a limited algorithm. This merge is only possible when both the incoming record and the pool record contain identifiers (LCCN or ISBN -- or OCLC number). A threshold weight is set for each format in the global parameters table. If a record receives a weight below the threshold, it is not considered a match and the record proceeds to Level 2. If the record receives a weight equal to or above the threshold, it is considered a match.

The threshold for monographs record merging is 875 points (weights are listed in Appendix A). The Level 1 merge is done on:

  • LCCN/ISBN/OCLC# [If more than one number is available, the higher points value is used.]
  • Publication date (from 008 pos. 07-10)
  • First 25 characters of the normalized title (245 a, b combined)

For efficiency, it is probably best to test each record in the pool against the level 1 algorithm, stopping when a match has been found. This ignores multiple matches in the pool, but if the database is seeded with a non-duplicative source (eg. LoC records) and each new source is matched against the database, the number of duplicate matches should be low (and may even be suspect).

Level 2 Merge

If a merge is not obtained at Level 1, then Level 2 steps are performed.

Full title match (245 a, b - normalized)
The title gets different values depending on how "perfect" the match.

  • Exact match (whole string matches)
  • Embedded match (one title is embedded in the other, left-anchored)
  • Keyword match (a percentage of the keywords that match between the titles, with additional points for having the keywords in the same order in the title)

Country of publication
This is an exact match between values from MARC 008 pos. 15-17.

Main Entry (100, 110, 111 - normalized)
Comparison of 1XX fields from the records. Since not all records have 1XX fields, there are default values assigned when one record has a 1XX and one doesn't, and when both are missing the 1XX.

  • Exact match
  • Keyword match (a percentage, useful mainly for 110 and 111 fields, which are corporate authors and conference names)

Pagination
Pagination is derived from the 300 $a field using the highest number found in that string. Pagination values are only used if they are greater than 10.

Examples:

 300 	 $a viii, 235 p. : = 235
 300 	 $a xxvi, 468 p., [32] p. of plates = 468
 300 	 $a 374 p. : = 374
 300 	 $a 4 v. in one box : = 4
  • Exact match
  • Match within + / - 10

Publisher
Publisher names in the 260 b field are not highly normalized by library cataloging so this field is used only in rare cases where a match has not been attained up to this point.

  • Exact match
  • Embedded match

Weights
The table of weights for our program allows us to assign negative and positive weights, with up to 5 different positive weights (not counting zero).

Record Merge Algorithm for ONIX records

ONIX records appear to have only a small number of fields filled in, so we can assume that an incoming ONIX record will only use the level 1 match algorithm. ONIX records in the database should not match to each other, since each record represents a single publisher's edition of a way. ONIX records may match to MARC records, and getting this match to work should be our goal.

Data Elements

ISBN

ONIX records will probably have both a 10-character and a 13-character ISBN. Here is an example:

<productidentifier>
      <b221>02</b221>
      <b244>0002154129</b244>
    </productidentifier>
    <productidentifier>
      <b221>03</b221>
      <b244>9780002154123</b244>
    </productidentifier>

The field coded "b221" with a value of "02" is the 10-character ISBN. The field coded "b221" with a value of "03" is the 13-character ISBN (which is equivalent to an EAN). Depending on how the software is handling ISBNs, either or both of these needs to be stored and indexed.

Title

The title is in the "title" field, labeled "b203". There are various kinds of titles that can be found in the ONIX standard. If there is more than one title, we should prefer the one with the "b202" code of "01", which means "Distinctive title".

	<title>
<b202>01</b202>
<b203>Wealth Protection Secrets of a Millionaire Real Estate Investor</b203>
</title>
	

One "catch" is that the ONIX titles may begin with initial articles, like "The" and "A". It is common in the library world to index titles with the initial articles removed, following the indicator value in the title field. We either need to create an index of MARC titles with the initial articles left on, or remove the most common ones from the ONIX titles so we can retrieve them with a string match, or something else that I haven't thought of.

Dates

The ONIX publication date is in YYYYMMDD format. The MARC coded date (from the 008) is YYYY. We should create an indexed form of the ONIX publication date with just the YYYY portion.

<b003>20060901</b003>

Merging with ONIX

Using these three data elements should give us enough that 1) incoming MARC records will retrieve ONIX records into their pool and possibly match with them and 2) incoming ONIX records will retrieve MARC records into their pool and possibly match. This theory will need to be tested.

Appendix A: Weights for Monographs
(The minimum weight required for merging is +875.)

LCCN      010 a Match on subfield a 200
Field present in both records but no match -320
Either record or both records missing 0
ISBN      020 a Match on subfield a 85
Field present in both records but no match -225
Either record or both records missing 0
Date      008 7-10 Exact match 200
+/-2 years -25
No match -250
Value missing 0
Short-Title      245 a,b Exact match on first 25 characters 450
Non match on first 25 characters 0
Full-Title      245 a,b Exact match 600
Either title contained within other title 350
Either title shorter than 9 characters 0
Non match -600
Matching keywords *

Country of      008 15-17
Publication

Exact match 40
Either one missing 0
Non match -205

Pagination      300 a

Match exactly and > 10 100
Match exactly and < 10 50
Match within 10 and both are > 10 50
Match within 10 and either are < 10 20
Non match (by more than 10) -225

Publisher      260 b

Exact match 100
Either missing 0
Occur within the other 100
Non match -25

Main Entry    100 a,b,c,d,k,q
                     110 a,b,c,d,k,n
            111 a,b,c,d,e,g,k,n,q

Exact match 125
Matching keywords **
Field missing from one record -25
Fields missing from both records 75
Non match -200

* Calculate weight based on the percentage of full title keywords common to the incoming record and the database record (% in common) x 450. If keywords are in the same order then add 50.
** If half or more of the main entry keywords are in common, calculate weight based on the percentage of keywords in common x 80. If keywords are in the same order then add 10.