Archive for pdNickname

pdNickname 3.0 released today

Version 2.0 users can get a free upgrade—see below

This new offering is loaded with almost 400,000 given names and nicknames. This represents virtually every first name found in the United States, and a quarter of the database is international names only found outside the United States.

SEE PDNICKNAME 3.0 NOW >>

The package is also replete with information about the languages associated with the names, their origin, and even has a popularity ranking for all first names appearing in the United States since 1915. It identifies more than 500 languages, dialects, and ethnic groups, from English and Spanish to Arabic and Swahili.

Key Features

  • Nicknames: the heart of the database is a huge collection of more than 200,000 nicknames and the names they are associated with, including short forms, abbreviations, diminutives, and even hypocoristics.
  • Name Variations: different ways of spelling given name and nicknames are identified based on linguistic and onomastic name tree research—some names can have more than a hundred variations.
  • Phonetic Matches: first names that are not true variations but sound similar or have close spellings are identified and rated on a 1 to 99 scale.
  • Languages: we have appended extensive demographic information about languages of usage.
  • Popularity: name popularity is ranked based of U.S. Census and U.S. Social Security records.
  • Special Origins: unique characteristics about the name origins are also provided, including their connections to religion, mythology, historical events, and literature.

A Pro edition even adds fuzzy logic which allows matching when names are entered with typographical errors.

Finally, students, teachers, scholars, and those researching family histories benefit as well because the software is highly recommended for study in genealogy, onomatology, anthroponymy, ethnology, linguistics, and related disciplines.

pdNickname 3.0 is available for immediate download from our website. It can be purchased as a standalone product (Pro, $495; Standard, $299), or as part of bundles now on sale, including pdSuite Names ($595) and pdSuite Master Collection ($695).

SEE PDNICKNAME 3.0 NOW >>

SEE OUR SUITES NOW >>

SPECIAL UPGRADE OFFER FOR 2.0 USERS

If you have pdNickname 2.0, for a limited time you can get a free upgrade to the new version. This applies to standalone versions and suites that contain the product. This offer expires June 30, 2016.

UPGRADE NOW >>

The language of names

According to Barbara Adair, Peacock Data’s chief development coordinator and spokesperson, the language features in the company’s long-standing flagship software packages, pdNickname and pdGender, and their new pdSurname product, “have never been available before on this scale and required a sizable portion of nearly six years of research and development.”

With this landmark software, among other powerful features, users can identify the languages associated with tens of thousands of first names, nicknames, and last names, providing them with critical ethnic and heritage demographics about their clients.

pdNickname is an advanced name and nickname file. It identifies first names that are the same even when they are not an exact match, but rather equivalent, such as a variation or nickname.

pdGender is gender coding database built on the same set of names as pdNickname. Users can match the data against the first names on their lists to determine male and female identification.

pdSurname is a new last name file. It identifies last names that are the same even when they are not an exact match, but rather equivalent, such as a variation or a similar sounding or spelled name.

All three innovative software packages embrace a host of similar and compatible features including languages of origin and use as well as fuzzy logic so information can be recognized even when names have misspellings or other typographical errors.

Combined, the three products cover more than 600 languages, dialects, ethnic groups, and races, such as English, Spanish, Portuguese, French, Italian, German, Polish, Russian, Chinese, Japanese, Vietnamese, Korean, Hindustani, Arabic, Persian, and Yiddish, as well as Native American names and ancient Greek, Latin, and Hebrew names.

Plans for the software releases were initially written up in January 2009 and development began in earnest mid-summer of that same year. They were built during parallel development cycles and began to be made available to the public with the release of pdNickname 2.0 and pdGender 2.0 on December 30, 2013. pdSurname, which has the largest database, was launched March 2, 2015.

According to Barbara Adair, “Creation of the master name file these new products result from is the biggest venture our company has ever undertaken. There are thousands of sources for names in scores of languages, and our task was to compare and contrast all this data and create the ultimate name resources.”

“From the start, it was essential to identify the languages and dialects associated names in considerable detail,” she added. “This gives users previously unavailable ethnic demographics linked to the names already on their lists.”

Barbara Adair showed some of the documents employed in construction of the software offerings, including a manuscript from 731 AD, written by a monk named Bebe, listing the earliest English names dating from the Anglo-Saxon era of the early medieval period. The still common personal name ‘Hilda’ is an example from the manuscript.

“Because sources often give diverse information and use different spelling conventions, it was crucial not only to gather all the data possible but also to differentiate between the quality of sources,” the spokesperson explained. “Better information became easier to identify after working with the sources over the course of the first year.”

Barbara Adair concluded, “These database packages are one-of-a-kind proprietary resources that let our users complete projects with significantly more success and in ways that were not possible before. They are very innovative pieces of software and we are encouraged and grateful for the response from our clients. A lot of time and hard work has been put into these efforts and it is a very exciting time for our customers and everyone here at Peacock Data.”

All three ground-breaking software packages are available for immediate download from the company’s website. They come with precision documentation, complete with examples, and perpetual multi-seat site licenses allowing installation on all computers in the same building within a single company or organization.

MORE ABOUT PDNICKNAME >>

MORE ABOUT PDGENDER >>

MORE ABOUT PDSURNAME >>

Optionally, they can also be licensed as part of the company’s pdSuite Names and pdSuite Master Collection software bundles.

About Peacock Data

California-based Peacock Data are the makers of database software products used by business, organizations, churches, schools, researchers, and government. They are an industry leader because of their superior solutions and renowned loyalty to customers.

For more than 20 years Peacock Data’s specialized software has been utilized in applications you use every day.

MORE ABOUT PEACOCK DATA >>

Affiliates program

DO YOU WANT TO SELL PEACOCK DATA PRODUCTS?

The firm’s affiliates program offers a unique way for your website or app to link to the Peacock Data product line. You will be provided with all of the tools necessary to convert your existing traffic into sales along with full support from dedicated affiliate managers. Apply now to join the program and earn substantial rewards!

Fuzzy logic generation 2.0

Peacock Data introduced the next generation of their fuzzy logic technology with last month’s release of the California-based firm’s pdSurname Pro last name matching software.

Accordion to company spokesperson Barbara Adair, “pdSurname facilitates identifying last names that are true variations or phonetically similar, while the fuzzy logic technology in the enhanced Pro edition allows finding names even when there are misspellings or other typographical errors.”

“We introduced fuzzy logic with our pdNickname Pro and pdGender Pro software in late 2013, but the new fuzzy logic generation 2.0 is a great enhancement,” Barbara Adair exclaimed.

According to the company, most of the enhancements were achieved after they developed a giant library of more than 80,000 language rules based on hundreds of dialects from around the world. Barbara Adair said, “Many misspellings occur as transcribers enter the sounds they hear. The character sequences and the sounds they produce are different for each language and situation, such as before, after, or between certain vowels and consonants, so our substitutions are language-rule based.”

The company explained additionally that their algorithms go even further by considering both how a name may sound to someone who speaks English as well as how it may sound to someone who speaks Spanish, which is often different. Barbara Adair explained, “Take the letter-pair ‘SC’ as an example. Before the vowels ‘E’ or ‘I’ it is most likely to be misspelled by an English speaker as ‘SHE’ or ‘SHI’ while a Spanish speaker may hear ‘CHE’ or ‘CHI’ and sometimes ‘YE’ or ‘YI’.”

Company literature indicates the new fuzzy logic generation 2.0 technology has five layers:

1. Phonetic misspellings: such as GUALTIERREZ misspelled as GUALTIEREZ, AAGARD misspelled as OUGHGARD, and YOUNGMAN misspelled as YONGMAN.

2. Reversed letters: such as DIELEMAN misspelled as DEILEMAN and RODREGUEZ misspelled as RODREUGEZ. These algorithms look for errors due to reversed digraphs (two letter sequences that form one phoneme or distinct sound) which are a common typographical issue, such as “IE” substituted for “EI”.

3. Double letter misspellings: such as HUMBER misspelled as HUMBEER and ZWOLLE misspelled as ZWOLE. The most common typographical issues occur with the characters, in order of frequency, “SS”, “EE”, “TT”, “FF”, “LL”, “MM”, and “OO”.

4. Missed keystrokes: such as HUNTER misspelled as UNTER, missing the initial “H”, and TAMERON misspelled as TAMRON, missing the “E” in the middle.

5. Other typographical errors: which cover a variety of additional misspelling issues.

The pdSurname Pro software with the new fuzzy logic generation 2.0 technology is available for immediate download and can currently be purchased at a 25 percent introductory discount (sale, $371.25; regular, $495) or as part of bundles also on sale, pdSuite Names (sale, $645; regular $795) and pdSuite Master Collection (sale, $795; regular, $995).

For users of other Peacock Data name software, Barbara Adair noted, “pdNickname Pro and pdGender Pro will be updated with fuzzy logic generation 2.0 capabilities this fall, and the upgrades will be free for anyone owning the older version.”

MORE ABOUT FUZZY LOGIC >>

MORE ABOUT PDSURNAME PRO >>

What is fuzzy logic?

Both pdNickname 2.x and pdGender 2.x are fully compatible with fuzzy logic. In these products, fuzzy logic involves slight variations in first names and nicknames based on common typographical errors and stylized spelling methods. The Pro edition of these packages comes equipped with fuzzy logic out of the box. Fuzzy logic add-ons can be appended to both the Pro and Standard versions.

The following illustrates the fuzzy logic technology employed in pdNickname 2.x and pdGender 2.x. Further information specific for these packages can be reviewed in the product user documentation found on our support page.

Typographical errors

A large majority of fuzzy logic records involve common typographical errors. These algorithms look at frequently reversed digraphs (a pair of letters used to make one phoneme or distinct sound), phonetically transcribed digraphs, double letters typed as single letters, single letters that are doubled, and other common data entry issues. The most likely typographical errors are determined based on the number of letters, the characters involved, where they are located in the name, and other factors.

The following are examples of fuzzy logic based on common typographical errors:

Example 1 | Real: AL | Fuzzy: ALL | the “L” is repeated
Example 2 | Real: ROCCO | Fuzzy: ROCO | the second “C” is left out
Example 3 | Real: CHRISTOPHER | Fuzzy: CHRISTOFER | the “PH” digraph is phonetically transcribed as “F”
Example 4 | Real: SOPHIA | Fuzzy: SOHPIA | the “PH” digraph is reversed
Example 5 | Real: MARGARET | Fuzzy: MARGRAET | the second “AR” digraph is reversed

Stylized spellings

Other fuzzy logic records involve stylized spelling methods. These algorithms look at non-regular characters such as extended ANSI characters (ASCII values 128 to 255) as well as hyphens, apostrophes, and spaces.

A few of the possible extended characters are “Á” (A-acute), “Ö” (O-umlaut), and “Ñ” (N-tilde). In these cases, “Á” becomes “A” (A-regular), “Ö” becomes “O” (O-regular), “Ñ” becomes “N” (N-regular), and other extended characters are treated similarly.

The following are examples of fuzzy logic based on stylized spellings:

Example 6 | Real: BJÖRK | Fuzzy: BJORK | spelled with O-regular instead of O-umlaut
Example 7 | Real: NICOLÁS | Fuzzy: NICOLAS | spelled with A-regular instead of A-acute
Example 8 | Real: ‘ASHTORET | Fuzzy: ASHTORET | spelled without an apostrophe prefix
Example 9 | Real: ABD-AL-HAMID | Fuzzy: ABDALHAMID | spelled without hyphens delimiting the name parts
Example 10 | Real: JUAN MARÍA | Fuzzy: JUANMARIA | spelled without the space between the two parts and with I-regular instead of I-acute

Fuzzy logic add-on packs and upgrades

Peacock Data releases additional fuzzy logic records nearly every month for pdNickname 2.x and pdGender 2.x in the form of add-on packs which can easily and economically be appended to the main databases extending coverage of typographical errors and stylized spelling methods.

The fuzzy logic technology built into the main Pro product downloads is designed to pick up statistically the most likely mistakes and stylizations. Fuzzy Logic Add-on Packs are designed to pick up less common mistakes and stylizations.

Add-on packs include new algorithms and randomizers and are fully compatible with both the Pro and Standard editions of these packages.

Those licensing the Standard edition of either product can also purchase a Standard to Pro Upgrade Pack which includes all the fuzzy logic records from the Pro edition. Once a Standard version is upgraded, it will be the same as the Pro edition.

Review the documentation provided with the fuzzy logic add-on packs and upgrades for further instructions.

Anatomy of a database, part 2

The first part of this column, Anatomy of a database, part 1, discussed the first four years of research and development for Peacock Data’s new name database products:

pdNickname 2.0 is an advanced name and nickname file used by businesses and organizations to merge database records.

pdGender 2.0 is a gender coding database built on the same set of names. Users can match the data against the first names on their lists to establish male and female identification.

Both upgrades embrace a host of similar and compatible features including languages of origin and use for each name as well as fuzzy logic so information can be recognized even when lists have typographical errors or uncommon spellings. They were built during the same development cycle because both are extracted from the same master file.

To recap, the main product research and development began in early 2009 and was completed by late 2012. Then beta versions of the new products entered field testing in January 2013.

According to the company’s chief development coordinator Barbara Adair, “By 2013 early planning for version 3.0 of the products was already underway and included new fuzzy logic technology designed to work with typographical errors and uncommon spellings. Then development proceeded so well that in April 2013 the new technology was moved up to the version 2.0 cycle.”

Barbara pointed out, “The most complex fuzzy logic involves predicting likely misspellings or alterations. We look at numerous factors that may occur in the spelling of a name. Common examples are frequently reversed digraphs (a pair of letters used to make one phoneme or distinct sound), phonetic transcriptions, double letters typed as single letters, non-common characters, the number of letters in a name, where elements occur in a name, and hundreds of other possible factors.”

“A lot of research and field trials have gone into creating the fuzzy logic algorithms and their inclusion in our new products will substantially increase their power for users,” she added.

“The difference between a real name and a fuzzy version can be very slight and even difficult to notice at first glance,” Barbara said. “But they are different and can make a big difference in the success rate for businesses and organizations working with lists of names.”

Barbara notes, “A sizable majority of the Pro edition of both new products is built with fuzzy logic, but users not ready to dive into the new technology can purchase a Standard edition without fuzzy logic and easily add it later when they are ready by contacting the company for an upgrade.”

As for the easiest part of development, Barbara quickly cited the special precision gender coding information in pdGender filtered for languages, rare usage of unisex names by one gender, and other criteria.

By the time we had established the language information in the master file and flagged name types and rare unisex usages, it was actually quite easy to draw out the gender coding fields,” she said. “This is a testament to the quality of the information and how straightforward it is to work with.”

Barbara said, “The new products do have a learning curve but are ultimately very easy to exploit. It may take a few uses, but those working with the data will appreciate more and more how the information is organized and presented. A lot of thought and field testing has gone into this.”

One result of the decision to build pdNickname and pdGender from the same master file is the strong compatibility between the two offerings.

“While pdNickname and pdGender can easily be used separately, when used jointly they make excellent partners,” Barbara said. “They are comprised of the same set of names and can be linked together with little effort.”

On November 1, 2013, Peacock Data demonstrated the products in front of participants gathered in their Chatsworth, California offices. By this time the new releases were almost ready to go and the development team working under Barbara began tweaking the final layouts and authoring the product documentation.

pdNickname 2.0 Pro and pdGender 2.0 Pro were released on Monday, December 30, 2013 and the Standard editions (without fuzzy logic) made their debut two weeks later.

pdNickname 2.0 Pro has 3.9 million records, including 2.61 million with fuzzy logic, and is 2.9 GB counting all formats and files. pdNickname 2.0 Standard has 1.28 million records, does not have fuzzy logic, and is 964 MG.

pdGender 2.0 Pro has 140,000 records, including 80,000 with fuzzy logic, and is 80.6 MB. pdGender 2.0 Standard has 60,000 records, does not have fuzzy logic, and is 25.5 MB.

Product information

Anatomy of a database, part 1

According to Peacock Data, plans for two just released product upgrades were initially written up in January 2009 and development began in earnest mid-summer of that same year. The products were built during the same development cycle because both are extracted from the same master file.

One of the new products is pdNickname 2.0, an advanced name and nickname file used by businesses and organizations to merge database records. They can match the data against their lists to determine if two or more records are the same individual. It identifies first names that are the same even when they are not an exact match, but rather equivalent, such as a variation or nickname.

The other new package is pdGender 2.0, a gender coding database built on the same set of names. Users can match the data against the first names on their lists to establish male and female identification.

Both upgrades embrace a host of similar and compatible features including languages of origin and use for each name as well as fuzzy logic so information can be recognized even when lists have typographical errors or uncommon spellings.

According to the company’s chief development coordinator Barbara Adair, “Creation of the master name file these new products result from is the biggest venture our company has ever undertaken. There are thousands of sources for names in scores of languages, and our task was to compare and contrast all this data and create the ultimate first name resource.”

Information drawn from the sources includes variant spellings, relationships with other names, and the languages and gender associated with each name.

Barbara pointed out, “The language features have never been available before on this scale and required a sizable portion of the nearly five years of research and development.”

“From the start it was essential to identify the languages associated which each name in considerable detail,” she added. “This gives users previously unavailable ethnic demographics linked to the names already on their lists.”

Barbara showed some of the documents used in construction of the new offerings including a manuscript from 731 AD, written by a monk named Bebe, listing the earliest English names dating from the Anglo-Saxon era of the Early Middle Ages. The still common personal name “Hilda” is an example from the manuscript.

Because sources often give diverse information and use different spelling conventions, it was crucial not only to gather all the information possible but also to differentiate between the quality of sources,” Barbra explained. “Better information became easier to identify after working with the sources over the course of the first year.”

About half the database records are English and Spanish names, and international names originating and used in over 200 other languages make up the second half. This includes such languages as French, German, Chinese, Japanese, Vietnamese, Korean, Hindustani, Arabic, Persian, and Yiddish as well as Native American names and ancient Greek, Latin, and Hebrew names.

According to Barbara, “Special attention is paid to rare usages of unisex names like Kimberly, Hillary, Valentine, and even Maria. Names like these, while usually associated with one gender, are also occasionally employed by both genders. The new products identify rare usages so they can be considered separately. pdGender in particular employs this technology out-of-the-box allowing users to ignore rare unisex usages when assigning gender.”

“Beyond just identifying the languages of use, we also classify name origins, such as Old English opposed to Middle English opposed to modern English,” Barbara noted. “This adds value for those researching personal names or the relationships between languages, such as in the fields of anthroponymy, onomatology, ethnology, and linguistics.”

According to the product documentation, both packages identify five basic first name types:

  • Base Names
  • Variations
  • Short Form Nicknames
  • Diminutives
  • Opposite Gender Forms

Assigning a type identification to each name was a lengthy part of development, but it is significant because the added information permits more precise filtering and ultimately better results,” Barbara said. “Base names are characteristically the oldest because they are the original names all later formations can be traced back to. A lot of time was devoted to these. It is important they are identified as accurately as possible because the remainder of the database is dependent on them.”

Most of the main product development was completed by the end of 2012 and field testing of beta editions commenced in January 2013.

See Anatomy of a database, part 2 for the rest of the story.

Product information

Using the pdNickname RELFLAG field

is a unique nearly 50,000 record database designed to facilitate comparing sets of first name data based on nicknames, diminutives, pet names, variations and given names. One of the most important fields in the database product is RELFLAG, which stands for “Relationship Flag”.

The RELFLAG field contains one of two possible values:

1 = Close relationship between the name and variation (common variants): Includes closely associated nicknames, diminutives and pet names as well as first name variations that are considered closely related.

2 = More distant relationship between the name and variation (less common variants): Includes alternate forms of the names, often deriving from another culture, as well as nicknames, diminutives and pet names that are relatively uncommon.

PDNICKNAME VARIATIONS FOR THE GIVEN NAME
pdNickname variations for the given name “SAMUAL&rdquo
The RELFLAG field indicates if the name and variation have a (1) close or (2) more distant relationship.

The RELFLAG field is useful for controlling what is to be considered an acceptable match. As more distant relationships are included in matches, the error rate naturally rises. The error rate increase is usually not substantial, but it is measurable in hundredths and tenths of a percent.

RECOMMENDATIONS

RESIDENTIAL: While additional accuracy can be achieved if only close relationships are considered, with residential lists, the margin of error rate increase is almost always very small even when the more distant relationships are included—rarely more than 0.02% in our testing. Therefore, under best practices, it is fully acceptable to use all RELFLAG relationships when matching residential lists. With the exception of the George Foreman family, most errors that might occur result from different given name that share the same nickname or other variation.

BUSINESS AND ORGANIZATION LISTS: On the other hand, with business and organization lists, when the more distant relationships are included the margin of error rate increase is typically higher, compared to residential lists. However, our testing normally shows an increase that is still less than 0.1%, but we have seen it as high as 0.3% with some large lists. Under best practices, it is recommended that only close relationships be considered when processing business and organization lists.

Restructuring the pdNickname database

An alternative structure for is to have one record per name with the variations in fields next to it. This tutorial explains how to do it.

Matching and merging names can be tricky. How do you relate William Smith with Bill Smith? The pdNickname database can be utilized to match names that are dissimilar because one has a given first name while another has a nickname or other variation.

Out of the box pdNickname is structured to allow immediate compatibility with the greatest number of database systems as well as to make it easy to become familiar with.

The nickname database is setup with two names per record. The first name field contains the names you are looking up, and in the second is a variation for each name—nickname, diminutive, given name, variant, etc. The same name can be listed several times in the first field, each time with a different variation. (See Figure 1.)

FIGURE 1: PDNICKNAME OUT OF THE BOX

If the names compared are Alexander Jones and Alex Jones, all names matching Alexander (NAME-A) are scanned until a variation is found that matches Alex (NAME-B). This works well, but there are other ways of organizing pdNickname that could work even better for you. In fact, we have restructured the table for utilization in our own services.

An alternative structure is to have one record per name and the variations in fields next to it. It is not practical to have separate fields for each variation, which can range from one to over two hundred. So what we do is have two Memo fields (also known as Long Text), one for close variations (relflag = "1") and the other for more distant variations (relflag = "2"), with the string of variations separated by delimiters for easier matching. (See Figure 2.)

FIGURE 2: PDNICKNAME RESTRUCTURED

Note: when browsing a table, normally you cannot see the content of a Memo or Long Text field because the database keeps it in a separate file. For this screenshot we have made the content visible.

Structured this way, when your program finds a match for NAME-A, it then determines if NAME-B can be found in variation field one or variation field two. This can be faster because you only access one record in each search request.

pdNickname, like all our Database Products, are structured to satisfy most users from the start. But there are many ways to integrate the databases into your system. It is up to you to determine what works best for you. Do not be afraid to experiment.