Archive for Database Management

Fuzzy logic generation 2.0

Peacock Data introduced the next generation of their fuzzy logic technology with last month’s release of the California-based firm’s pdSurname Pro last name matching software.

Accordion to company spokesperson Barbara Adair, “pdSurname facilitates identifying last names that are true variations or phonetically similar, while the fuzzy logic technology in the enhanced Pro edition allows finding names even when there are misspellings or other typographical errors.”

“We introduced fuzzy logic with our pdNickname Pro and pdGender Pro software in late 2013, but the new fuzzy logic generation 2.0 is a great enhancement,” Barbara Adair exclaimed.

According to the company, most of the enhancements were achieved after they developed a giant library of more than 80,000 language rules based on hundreds of dialects from around the world. Barbara Adair said, “Many misspellings occur as transcribers enter the sounds they hear. The character sequences and the sounds they produce are different for each language and situation, such as before, after, or between certain vowels and consonants, so our substitutions are language-rule based.”

The company explained additionally that their algorithms go even further by considering both how a name may sound to someone who speaks English as well as how it may sound to someone who speaks Spanish, which is often different. Barbara Adair explained, “Take the letter-pair ‘SC’ as an example. Before the vowels ‘E’ or ‘I’ it is most likely to be misspelled by an English speaker as ‘SHE’ or ‘SHI’ while a Spanish speaker may hear ‘CHE’ or ‘CHI’ and sometimes ‘YE’ or ‘YI’.”

Company literature indicates the new fuzzy logic generation 2.0 technology has five layers:

1. Phonetic misspellings: such as GUALTIERREZ misspelled as GUALTIEREZ, AAGARD misspelled as OUGHGARD, and YOUNGMAN misspelled as YONGMAN.

2. Reversed letters: such as DIELEMAN misspelled as DEILEMAN and RODREGUEZ misspelled as RODREUGEZ. These algorithms look for errors due to reversed digraphs (two letter sequences that form one phoneme or distinct sound) which are a common typographical issue, such as “IE” substituted for “EI”.

3. Double letter misspellings: such as HUMBER misspelled as HUMBEER and ZWOLLE misspelled as ZWOLE. The most common typographical issues occur with the characters, in order of frequency, “SS”, “EE”, “TT”, “FF”, “LL”, “MM”, and “OO”.

4. Missed keystrokes: such as HUNTER misspelled as UNTER, missing the initial “H”, and TAMERON misspelled as TAMRON, missing the “E” in the middle.

5. Other typographical errors: which cover a variety of additional misspelling issues.

The pdSurname Pro software with the new fuzzy logic generation 2.0 technology is available for immediate download and can currently be purchased at a 25 percent introductory discount (sale, $371.25; regular, $495) or as part of bundles also on sale, pdSuite Names (sale, $645; regular $795) and pdSuite Master Collection (sale, $795; regular, $995).

For users of other Peacock Data name software, Barbara Adair noted, “pdNickname Pro and pdGender Pro will be updated with fuzzy logic generation 2.0 capabilities this fall, and the upgrades will be free for anyone owning the older version.”

MORE ABOUT FUZZY LOGIC >>

MORE ABOUT PDSURNAME PRO >>

Making a data dictionary

A data dictionary is a document that catalogs the organization, contents and conventions of a database or collection of databases. It lists in written form all the databases, tables, views, fields and data definitions and often information about the table layouts, the relationships between tables and other details about the database schema.

Making a data dictionaryIt does not contain the actual data from the database system, only information necessary to manage and utilize it. It is also not an instruction manual, though a data dictionary is often included as part of an instruction manual.

There is no universal standard as to the level of detail in a data dictionary. What is included is dependent on the audience and the complexity of the database infrastructure. System administrators and programmers will usually have a highly detailed document, sometimes complete with visual depictions, while end users may only have the basics.

Below is an example of a data dictionary for a bookkeeping database with three tables. It shows the kinds of information typically included in a data dictionary, however, it is not meant to be all-inclusive. Other columns that might be provided could show if a field takes null values and the precise points where each field begins and ends. If scientific or technical information is involved, a column indicating normative ranges may be useful. The possibilities are myriad.

A data dictionary is an important part of database system documentation. Devoting the resources needed for a quality document will help insure fewer problems and significantly aid in productivity.

EXAMPLE DATA DICTIONARY FOR A BOOKKEEPING DATABASE

Number of Tables: 3

Table: name of the table. Field: name of the field. Rel: Table relationship key (if any); PK = primary key, FK = foreign key; see Foreign Key Relationships. Type: field data type. Width: field width. Dec: number of decimal points (if any). Description: data definition of the field contents.

Foreign Key Relationships: (1) points to Customers table Id field. (2) points to Sales table Invoice field.

Table Field Rel Type Width Dec Description
CUSTOMERS ID PK Character 10   Customer ID number
NAME   Character 25   Customer name
CUST_TYPE   Character 1   Customer type (key):

A = Active
I = Inactive
P = Prospect
TERMS   Character 1   Payment terms (key):

N = Net Due
P = Prepaid
SALES INVOICE PK Character 4   Invoice number
CUST_ID FK (1) Character 10   Customer ID number
SAL_DATE   Date 8   Date of sale
SAL_AMOUNT   Numeric 10 2 Amount of sale
RECEIPTS ID PK Character 10   Unique ID number
INV_NUM FK (2) Character 4   Invoice number
REC_DATE   Date 8   Date of receipt
REC_AMOUNT   Numeric 10 2 Amount of receipt

Happy birthday Dr. Codd

Relational database theory took shape in the 1960s and 1970s, and most of the thinking and enthusiasm behind it came from Dr. Edgar Frank “Ted” Codd, while working at IBM’s Almaden Research Labs in a then nascent Silicon Valley.

Dr. Codd, born August 19, 1923 on the Isle of Portland in England, studied mathematics and chemistry at Exeter College, Oxford, before serving as a pilot in the Royal Air Force during World War II. He moved to New York in 1948 to work for IBM as a mathematical programmer, but five years later migrated to Ottawa, Canada as a response to the rhetoric of Senator Joseph McCarthy. Not long after he returned to the United States, and in 1965 received a PhD in computer science from the University of Michigan, Ann Arbor. Two years later he again began working for IBM, this time at their research laboratory in San Jose, California, where he soon revolutionized database software by advocating a new relational model for database management.

Based on set theory (a branch of mathematical logic), Dr. Codd’s relational database model was developed at a time when most database platforms employed a hierarchical system, commonly known as the Codasyl database approach, in which the structure of data had to be defined within each application program. Dr. Codd’s relational database used a new query language (eventually becoming SEQUEL and later SQL) to access any combination of data stored in cross-referenced tables.

After an internal IBM paper a year earlier, Dr. Codd outlined his concept in A Relational Model of Data for Large Shared Data Banks published in 1970. However, due to vested interests in IBM’s then current hierarchical database approaches, such as IMS/DB, Dr. Codd’s ideas were not adopted until commercial rivals began implementing them. Eventually in 1981 IBM released its first commercial relational database management system in the form of SQL/DS (Structured Query Language/Data System), and in 1983 released DB2, also SQL based, for the MVS operating system. The two products have coexisted since then, but SQL/DS was rebranded as DB2 for VM and VSE in the late 1990s.

The relational database model has since made its way into countless successful products, including Microsoft SQL Server, Microsoft Access, Microsoft FoxPro and Visual FoxPro, dataBased Intelligence dBase, Alaska Software XBase++, Apollo Database Engine, Apycom Software DBFView, Astersoft DBF Manager, Digital Equipment Rdb (now Oracle Rdb), DS-Datasoft Visual DBU, Elsoft DBF Commander, GrafX Software Clipper and Vulcan.NET, Informix (now IBM Infomix), Multisoft FlagShip, Oracle Database, Recital Software Recital, Relational Technology Ingres, Software Perspectives Cule.Net, Sybase, and xHarbour.com xHarbour, to name a few, and the list goes on.

Dr. Codd did not become wealthy from his ideas, but he received many accolades, and will long be remembered. In 2004 SIGMOD (Association for Computing Machinery), which specializes in large-scale data management problems and databases, renamed its highest prize to the SIGMOD Edgar F. Codd Innovations Award, in his honor.

The inventor of the relational database system died on April 18, 2003 of heart failure at his home in Williams Island, Florida, at the age of 79.

What is fuzzy logic?

Both pdNickname 2.x and pdGender 2.x are fully compatible with fuzzy logic. In these products, fuzzy logic involves slight variations in first names and nicknames based on common typographical errors and stylized spelling methods. The Pro edition of these packages comes equipped with fuzzy logic out of the box. Fuzzy logic add-ons can be appended to both the Pro and Standard versions.

The following illustrates the fuzzy logic technology employed in pdNickname 2.x and pdGender 2.x. Further information specific for these packages can be reviewed in the product user documentation found on our support page.

Typographical errors

A large majority of fuzzy logic records involve common typographical errors. These algorithms look at frequently reversed digraphs (a pair of letters used to make one phoneme or distinct sound), phonetically transcribed digraphs, double letters typed as single letters, single letters that are doubled, and other common data entry issues. The most likely typographical errors are determined based on the number of letters, the characters involved, where they are located in the name, and other factors.

The following are examples of fuzzy logic based on common typographical errors:

Example 1 | Real: AL | Fuzzy: ALL | the “L” is repeated
Example 2 | Real: ROCCO | Fuzzy: ROCO | the second “C” is left out
Example 3 | Real: CHRISTOPHER | Fuzzy: CHRISTOFER | the “PH” digraph is phonetically transcribed as “F”
Example 4 | Real: SOPHIA | Fuzzy: SOHPIA | the “PH” digraph is reversed
Example 5 | Real: MARGARET | Fuzzy: MARGRAET | the second “AR” digraph is reversed

Stylized spellings

Other fuzzy logic records involve stylized spelling methods. These algorithms look at non-regular characters such as extended ANSI characters (ASCII values 128 to 255) as well as hyphens, apostrophes, and spaces.

A few of the possible extended characters are “Á” (A-acute), “Ö” (O-umlaut), and “Ñ” (N-tilde). In these cases, “Á” becomes “A” (A-regular), “Ö” becomes “O” (O-regular), “Ñ” becomes “N” (N-regular), and other extended characters are treated similarly.

The following are examples of fuzzy logic based on stylized spellings:

Example 6 | Real: BJÖRK | Fuzzy: BJORK | spelled with O-regular instead of O-umlaut
Example 7 | Real: NICOLÁS | Fuzzy: NICOLAS | spelled with A-regular instead of A-acute
Example 8 | Real: ‘ASHTORET | Fuzzy: ASHTORET | spelled without an apostrophe prefix
Example 9 | Real: ABD-AL-HAMID | Fuzzy: ABDALHAMID | spelled without hyphens delimiting the name parts
Example 10 | Real: JUAN MARÍA | Fuzzy: JUANMARIA | spelled without the space between the two parts and with I-regular instead of I-acute

Fuzzy logic add-on packs and upgrades

Peacock Data releases additional fuzzy logic records nearly every month for pdNickname 2.x and pdGender 2.x in the form of add-on packs which can easily and economically be appended to the main databases extending coverage of typographical errors and stylized spelling methods.

The fuzzy logic technology built into the main Pro product downloads is designed to pick up statistically the most likely mistakes and stylizations. Fuzzy Logic Add-on Packs are designed to pick up less common mistakes and stylizations.

Add-on packs include new algorithms and randomizers and are fully compatible with both the Pro and Standard editions of these packages.

Those licensing the Standard edition of either product can also purchase a Standard to Pro Upgrade Pack which includes all the fuzzy logic records from the Pro edition. Once a Standard version is upgraded, it will be the same as the Pro edition.

Review the documentation provided with the fuzzy logic add-on packs and upgrades for further instructions.

Anatomy of a database, part 2

The first part of this column, Anatomy of a database, part 1, discussed the first four years of research and development for Peacock Data’s new name database products:

pdNickname 2.0 is an advanced name and nickname file used by businesses and organizations to merge database records.

pdGender 2.0 is a gender coding database built on the same set of names. Users can match the data against the first names on their lists to establish male and female identification.

Both upgrades embrace a host of similar and compatible features including languages of origin and use for each name as well as fuzzy logic so information can be recognized even when lists have typographical errors or uncommon spellings. They were built during the same development cycle because both are extracted from the same master file.

To recap, the main product research and development began in early 2009 and was completed by late 2012. Then beta versions of the new products entered field testing in January 2013.

According to the company’s chief development coordinator Barbara Adair, “By 2013 early planning for version 3.0 of the products was already underway and included new fuzzy logic technology designed to work with typographical errors and uncommon spellings. Then development proceeded so well that in April 2013 the new technology was moved up to the version 2.0 cycle.”

Barbara pointed out, “The most complex fuzzy logic involves predicting likely misspellings or alterations. We look at numerous factors that may occur in the spelling of a name. Common examples are frequently reversed digraphs (a pair of letters used to make one phoneme or distinct sound), phonetic transcriptions, double letters typed as single letters, non-common characters, the number of letters in a name, where elements occur in a name, and hundreds of other possible factors.”

“A lot of research and field trials have gone into creating the fuzzy logic algorithms and their inclusion in our new products will substantially increase their power for users,” she added.

“The difference between a real name and a fuzzy version can be very slight and even difficult to notice at first glance,” Barbara said. “But they are different and can make a big difference in the success rate for businesses and organizations working with lists of names.”

Barbara notes, “A sizable majority of the Pro edition of both new products is built with fuzzy logic, but users not ready to dive into the new technology can purchase a Standard edition without fuzzy logic and easily add it later when they are ready by contacting the company for an upgrade.”

As for the easiest part of development, Barbara quickly cited the special precision gender coding information in pdGender filtered for languages, rare usage of unisex names by one gender, and other criteria.

By the time we had established the language information in the master file and flagged name types and rare unisex usages, it was actually quite easy to draw out the gender coding fields,” she said. “This is a testament to the quality of the information and how straightforward it is to work with.”

Barbara said, “The new products do have a learning curve but are ultimately very easy to exploit. It may take a few uses, but those working with the data will appreciate more and more how the information is organized and presented. A lot of thought and field testing has gone into this.”

One result of the decision to build pdNickname and pdGender from the same master file is the strong compatibility between the two offerings.

“While pdNickname and pdGender can easily be used separately, when used jointly they make excellent partners,” Barbara said. “They are comprised of the same set of names and can be linked together with little effort.”

On November 1, 2013, Peacock Data demonstrated the products in front of participants gathered in their Chatsworth, California offices. By this time the new releases were almost ready to go and the development team working under Barbara began tweaking the final layouts and authoring the product documentation.

pdNickname 2.0 Pro and pdGender 2.0 Pro were released on Monday, December 30, 2013 and the Standard editions (without fuzzy logic) made their debut two weeks later.

pdNickname 2.0 Pro has 3.9 million records, including 2.61 million with fuzzy logic, and is 2.9 GB counting all formats and files. pdNickname 2.0 Standard has 1.28 million records, does not have fuzzy logic, and is 964 MG.

pdGender 2.0 Pro has 140,000 records, including 80,000 with fuzzy logic, and is 80.6 MB. pdGender 2.0 Standard has 60,000 records, does not have fuzzy logic, and is 25.5 MB.

Product information

Anatomy of a database, part 1

According to Peacock Data, plans for two just released product upgrades were initially written up in January 2009 and development began in earnest mid-summer of that same year. The products were built during the same development cycle because both are extracted from the same master file.

One of the new products is pdNickname 2.0, an advanced name and nickname file used by businesses and organizations to merge database records. They can match the data against their lists to determine if two or more records are the same individual. It identifies first names that are the same even when they are not an exact match, but rather equivalent, such as a variation or nickname.

The other new package is pdGender 2.0, a gender coding database built on the same set of names. Users can match the data against the first names on their lists to establish male and female identification.

Both upgrades embrace a host of similar and compatible features including languages of origin and use for each name as well as fuzzy logic so information can be recognized even when lists have typographical errors or uncommon spellings.

According to the company’s chief development coordinator Barbara Adair, “Creation of the master name file these new products result from is the biggest venture our company has ever undertaken. There are thousands of sources for names in scores of languages, and our task was to compare and contrast all this data and create the ultimate first name resource.”

Information drawn from the sources includes variant spellings, relationships with other names, and the languages and gender associated with each name.

Barbara pointed out, “The language features have never been available before on this scale and required a sizable portion of the nearly five years of research and development.”

“From the start it was essential to identify the languages associated which each name in considerable detail,” she added. “This gives users previously unavailable ethnic demographics linked to the names already on their lists.”

Barbara showed some of the documents used in construction of the new offerings including a manuscript from 731 AD, written by a monk named Bebe, listing the earliest English names dating from the Anglo-Saxon era of the Early Middle Ages. The still common personal name “Hilda” is an example from the manuscript.

Because sources often give diverse information and use different spelling conventions, it was crucial not only to gather all the information possible but also to differentiate between the quality of sources,” Barbra explained. “Better information became easier to identify after working with the sources over the course of the first year.”

About half the database records are English and Spanish names, and international names originating and used in over 200 other languages make up the second half. This includes such languages as French, German, Chinese, Japanese, Vietnamese, Korean, Hindustani, Arabic, Persian, and Yiddish as well as Native American names and ancient Greek, Latin, and Hebrew names.

According to Barbara, “Special attention is paid to rare usages of unisex names like Kimberly, Hillary, Valentine, and even Maria. Names like these, while usually associated with one gender, are also occasionally employed by both genders. The new products identify rare usages so they can be considered separately. pdGender in particular employs this technology out-of-the-box allowing users to ignore rare unisex usages when assigning gender.”

“Beyond just identifying the languages of use, we also classify name origins, such as Old English opposed to Middle English opposed to modern English,” Barbara noted. “This adds value for those researching personal names or the relationships between languages, such as in the fields of anthroponymy, onomatology, ethnology, and linguistics.”

According to the product documentation, both packages identify five basic first name types:

  • Base Names
  • Variations
  • Short Form Nicknames
  • Diminutives
  • Opposite Gender Forms

Assigning a type identification to each name was a lengthy part of development, but it is significant because the added information permits more precise filtering and ultimately better results,” Barbara said. “Base names are characteristically the oldest because they are the original names all later formations can be traced back to. A lot of time was devoted to these. It is important they are identified as accurately as possible because the remainder of the database is dependent on them.”

Most of the main product development was completed by the end of 2012 and field testing of beta editions commenced in January 2013.

See Anatomy of a database, part 2 for the rest of the story.

Product information

A brief history of encryption

Encryption, part of the science of cryptography, is the process of transforming information using an algorithm (called a cipher) to make it unreadable to anyone except those possessing a special key to unlock the information. The unlocking process reverses the encryption and is known as decryption. According to a 2007 report by the Computer Security Institute, 71% of the companies surveyed utilize encryption for some of their data in transit while 53% utilized encryption for some of their data in storage.

Encryption, however, is not an idea that was born in the computer age. It has an extensive past and has long been used by militaries, diplomats, and governments to facilitate secret communications. Here is a brief history:

Writing

Early encryption was solely concerned with converting messages into unreadable groups of figures to protect the content during the time the message was being carried from one place to another or otherwise in public view. The first approach to cryptography was the simple writing of a message. It worked because most people could not read.

As early as four-thousand years ago, Egyptian scribes decorate the tombs of deceased rulers and kings with irregular hieroglyphs that some experts believe were intended to hide the meaning of the words. Others content they were designed to make the text appear more regal and important.

The ancient Chinese utilized the ideographic nature of their language to hide the meaning of words, but unlike most cryptographic methods, it was used for personal correspondences, and not for military intelligence. Genghis Khan, for example, is not believed to have employed encryption.

Scytale

The first known encryption device, invented by the Spartans around 400 BC, was a cylinder shaped baton or staff called a scytale. The coding technique consisted of wrapping a strip of leather or parchment around a scytale of a specific diameter. The sender wrote the message down the length of the staff, and then unwrapped the material, resulting in the order of the letters being scrambled. The recipient would then wrap the parchment back around a scytale of the same diameter, and the letters would be in the proper order again.

Caesar cipher

Next came the Caesar cipher which uses a simple system, today known as a substitution cipher, where each letter is shifted a specific number of positions up or down the alphabet. The recipient would decrypt the message by knowing in advance how many positions were needed to reverse the substitution or, probably more often, by trial and error until the correct shift was determined. It is named after Julius Caesar (100 BC–44 BC) because he used it to protect correspondence of military significance, but other substitution ciphers are known to have been used earlier. As a historical note, Caesar used a shift of three, and the second most common shift in ancient times was four.

Another simple substitution cipher is the atbash for the Hebrew alphabet. It is mentioned in the Old Testament Book of Jeremiah and involves replacing the first letter of the alphabet with the last, the second letter with the second last, and continuing in that manner.

More complex ciphers

Simple substitution methods of cryptography were the norm for about one thousand years, but during the Middle Ages turned ineffective as the practice became better known and the population more literate. Aided by advances in mathematics, they were replaced by various random substitution methods as well as an approach known as columnar transposition where messages are written in columns, and then the columns are rearranged.

Italian author and cryptographer Leon Battista Alberti (1404–1472), who has been called the “Father of Western Cryptography”, invented the polyalphabetic substitution method which uses multiple substitution alphabets. A form known as the Vigenère cipher, which employs a series of different Caesar ciphers based on the letters of a keyword, withstood three centuries of attempts to break it. It was still being used by the confederates during the American Civil War, but their messages were regularly cracked by the Union.

A German cryptographer named Johannes Trithemius (1462–1516) wrote Polygraphiae, published in 1518, the first printed book on cryptology. He also invented a steganographic cipher in which each letter was represented by a word taken from a sequence of columns, with the resulting series of words making up prayer.

Cipher wheel

During his term as George Washington’s secretary of state (1790–1793), Thomas Jefferson devised a machine he called a “cipher wheel” consisting of twenty-six cylindrical wooden pieces threaded onto an iron spindle. The letters of the alphabet were inscribed on the edge of each wheel in a random order and turning the wheel scrambled and unscrambled the message. The recipient would decrypt the message with another cipher wheel and trial and error until the correct combination was determined.

Jefferson learned the importance of coded messages while he served as the U.S. minister to France (1784–1789) and found that European postmasters regularly opened and read all letters passing through their command.

ENIGMA

The most famous encryption device is ENIGMA, an electro-mechanical rotor cipher machine invented by German engineer Arthur Scherbius at the end of World War I and used by NAZI Germany during World War II. It employs multiple substitution ciphers that rotate iteratively, thereby minimizing the danger of frequency analysis discovering the key. U.S. cryptographers did however break the code during the war, which was a celebrated event for the Allies and affected military intelligence for the remainder of the war.

Modern times

Since the dawn of the computer era, more and more information has become accessible through the internet, both while at rest on servers and while in transmission from one network to another, advancing the needs and requirements of encryption by leaps and bounds. Modern versions now include such enhancements as message integrity checking, identity authentication, and digital signatures.

Today encryption is particularly important for protecting credit card numbers, Social Security numbers, passwords, personal identification numbers (PINs), client lists, business plans, and corporate intelligence. However, for many it is also important for concealing political dissent as well as opposition in the workplace.

Pretty Good Privacy

Today the best known encryption for business and private use is Pretty Good Privacy (PGP) developed by Philip R. “Phil” Zimmermann, Jr. Originally designed as a human rights tool, PGP was published for free on the internet in 1991. It is now owned by PGP Corporation (a part of Symantec).

There is no known method which will allow someone to break PGP encryption, and it has been described by computer security specialist Bruce Schneier as “the closest you’re likely to get to military-grade encryption.” Nonetheless, early versions of PGP were found to have theoretical vulnerabilities, so new editions have been released and development continues.

PGP and similar products follow the OpenPGP standard for encrypting and decrypting data. Defined by the OpenPGP Working Group of the Internet Engineering Task Force (IETF) in Proposed Standard RFC 4880, OpenPGP is today the most widely used email encryption standard in the world.

Advanced Encryption Standard

The most secure encryption at present is recognized as Advanced Encryption Standard (AES) developed by Belgian cryptographers Joan Daemen and Vincent Rijmen. It is based on the Rijndael key schedule which combines multiple transformations. The cipher is described as a symmetric-key algorithm in which the same key is used for both encryption and decryption. AES has been adopted by the U.S. government and is now used worldwide.

Recent concerns

Security uncertainties involving encryption have been recent news items. The most public example occurred in May and June of 2013. Citing documents leaked by former National Security Agency (NSA) contract employee Edward Snowden, U.K. newspaper The Guardian reported that the NSA and its British counterpart, the GCHQ, have cracked ciphers that are widely used to protect online information, including email and other communications, banking and medical records, and corporate data.

For years prior, some have expressed alarm that intelligence and defense agencies throughout the world, including the NSA the U.S. defense department, spend billions of dollars to crack encryption codes.

Security experts nevertheless maintain, if properly implemented and maintained, encryption is still the best practical way to secure data.

The internet is a wonderful resource that allows us to communicate instantly with anyone, virtually anywhere—but it also allows access to cyber criminals and electronic hackers who can steal millions of dollars or wreak havoc in seconds.

It is safest to keep sensitive information on removable storage media such as portable external hard drives and flash memory drives. However, these can be lost, and this is not a useful security solution in many situations.

In a great majority cases the impact of unwanted disclosure of information is negligible—but not always. Do not discount encryption as too difficult, time consuming, or expensive. It is not and in some cases is required by law. There is a good chance one day you will be glad you utilize it.

What are jam values?

In the world of statistical databases, jam values are common. They are hard-coded information displayed instead of derived measures. They are used to represent unique situations where either the information to be conveyed is an explanation for the absence of data, represented by a symbol in the data display, such as a dot “.”, or the information to be conveyed is an open-ended distribution, such as 1,000,000 or greater, represented by 1000001. Even an empty value or a zero (“0”) will often have a special meaning.

Jam values can also be used to explain why information cannot be disclosed, such as for privacy reasons or because the data does not meet certain filtering criteria.

Data definitions are provided with jam values and normally they are not difficult to utilize. Depending on the parameters of the project, some may even choose to ignore them altogether. However, it is important to understand that they exist, and are not treated like numbers in tabulations and analysis.

Special consideration must be taken when importing data with jam values because they are often in alpha form. Non-numeric values will be converted to zeros (0) if appended into fields that accept only numeric information.

Some users convert jam values to special numbers during the import process so numeric fields can be used. Numeric fields are easier to work with because they do not have to be converted when counted or used in calculations.

You will find jam values frequently employed in American Community Survey (ACS) estimates and margins of error as well as in our own ACS demographics product, pdACS2013.

CASS vs. DSF2

With respect to database management, one of the questions we are most often asked concerns the difference between CASS and DSF2 address/mail processing. These services have grown more similar in recent years, but there are still important differences.

CASS

CASS – Coding Accuracy Support System – standardizes addresses and improves the accuracy of carrier route, five-digit ZIP, ZIP+4 and delivery point codes that appear on mail pieces. When mailers CASS certify their lists, they can become eligible for bulk mail discounts from the U.S. Postal Service (USPS).

In the past, applying these codes only required that mailing address fall within a valid address range. The address itself was not necessarily accurate. For example, ZIP+4 codes typically represent a range of 100 or fewer house numbers on a given street. While 100-199 Rollins Road may be assigned to ZIP+4 code 91311-7007, not all those house numbers are actual delivery points.

Since November 2, 2006, the USPS has determined that CASS-related discounts will only apply when the agency’s Delivery Point Validation (DPV) process confirms the primary number—or the first line—of the addresses. If the primary street number of an address cannot be confirmed with DPV, then no ZIP+4 code is assigned to the mail. Without a ZIP+4 code, the mail ceases to qualify for postal discounts. DPV is therefore now included with CASS processing.

CASS also integrates Locatable Address Conversion System service (LACSLink) which converts some PO Box and rural route addresses to city-style addresses and updates addresses that have been renumbered or renamed.

CASS processing software is graded by the USPS National Customer Support Center (NCSC), located in Memphis, Tennessee, by processing a test file. To be certified a 98.5% accuracy rate is required.

DSF2

Prior to the November 2006 requirements added to CASS processing, DSF2 – Delivery Sequence File Second Generation – was the only fail-safe method of checking the accuracy of mailing addresses since CASS only determined if the address fell within a valid address range. But DSF2 continues to offer advantages that can make it a good choice.

In addition to all the processing and validation supplied with CASS, DSF2 also provides mailers with other important address attributes. It classifies an address as residential or business, identifies the delivery type—curb-side, door-slot, Neighborhood Delivery and Collection Box Unit (NDCBU) or central delivery—and flags seasonal and vacant addresses.

This information allows for more targeted mailings. For example, a company may want to send out a sample box of laundry detergent—with DSF2 they can target their mailing to residential addresses that do not have a door-slot mailbox since the sample would not fit through the slot.

DSF2 also appends codes that can be used to qualify for USPS Enhanced Carrier Route (ECR) Basic, High Density and Saturation postal discounts. These include Pseudo Delivery Sequence Numbers for walk sequence postal presort discounts and Enhanced Line of Travel codes (eLOT).

CASS and DSF2 have similarities and important differences. The choice is determined by the targeting required and the postal discounts that can be qualify for. In both cases, however, users will typically benefit from a significant reduction in undeliverable mail, speedier delivery and lower costs.

Regularly review database systems

Maintaining a successful database infrastructure requires regular review to establish what is going right and where problems may be lurking. This should consist of close consideration of all aspects of the database system, including: hardware & software, administration & input, documentation, staff & training and disaster recovery.

The frequency of database system reviews is dependent on the size of a company or organization and the complexity of the system. At a minimum, we recommend quarterly reviews, but many situations warrant more frequent action.

Prior to conducting any database system review it is important to establish a set of benchmarks and checklists to compare against the findings.

HARDWARE & SOFTWARE: A database system review should begin with an assessment of the applications, computers, workstations, network servers and other devices that underlie and run the system. Decide if they are meeting the expectations and requirements of end users and if they are doing so efficiently. Needs change and technologies grow rapidly, and keeping on top of the machinery is essential in a competitive environment.

ADMINISTRATION & INPUT: Take a long look at how the database system is administered and the input into the tables. Determine if there has been unauthorized augmentation of a database, particularly rogue changes, or if unapproved or non-standard practices and tools are utilized. This will help maintain the system’s physical and logical models as well as prevent costly downtime and gaps in performance.

DOCUMENTATION: The fun often stops for tech people when they have to hang up their programming tools and put their work down on paper. Consequently, documentation is often lacking for database systems. Make sure this is not the case because end use of the system is highly affected. Documentation should include a data dictionary and reflect the current physical and logical state of the infrastructure as well as be understandable to the less tech savvy.

STAFF & TRAINING: The people part of the review is very important because a database system is only as good as those that run and use it. It is important to align duties properly, and the staff needs to have the necessary expertise and training to adequately leverage the technology and be equipped to handle new complexities in the infrastructure. Investment in this area can reap large rewards.

DISASTER RECOVERY: Last but definitely not least, asses the database system in terms of its ability to recover from a disaster. Backups need to be performed regularly and properly stored, and it is vital this includes offsite backup. Additionally, make sure there is an adequate plan for unforeseen complications and worst-case scenarios and that the system’s immunization against viruses, worms and other web-based attacks is at full strength. This is particularly important when there are substantial changes to the database infrastructure.

Database management can become overwhelming as requirements escalate and the volume of data mushrooms. Regular review of a database system is essential to preserve the return on investment, meet objectives and insure long-term success.