Author Archive for admin – Page 2

Anatomy of a database, part 2

The first part of this column, Anatomy of a database, part 1, discussed the first four years of research and development for Peacock Data’s new name database products:

pdNickname 2.0 is an advanced name and nickname file used by businesses and organizations to merge database records.

pdGender 2.0 is a gender coding database built on the same set of names. Users can match the data against the first names on their lists to establish male and female identification.

Both upgrades embrace a host of similar and compatible features including languages of origin and use for each name as well as fuzzy logic so information can be recognized even when lists have typographical errors or uncommon spellings. They were built during the same development cycle because both are extracted from the same master file.

To recap, the main product research and development began in early 2009 and was completed by late 2012. Then beta versions of the new products entered field testing in January 2013.

According to the company’s chief development coordinator Barbara Adair, “By 2013 early planning for version 3.0 of the products was already underway and included new fuzzy logic technology designed to work with typographical errors and uncommon spellings. Then development proceeded so well that in April 2013 the new technology was moved up to the version 2.0 cycle.”

Barbara pointed out, “The most complex fuzzy logic involves predicting likely misspellings or alterations. We look at numerous factors that may occur in the spelling of a name. Common examples are frequently reversed digraphs (a pair of letters used to make one phoneme or distinct sound), phonetic transcriptions, double letters typed as single letters, non-common characters, the number of letters in a name, where elements occur in a name, and hundreds of other possible factors.”

“A lot of research and field trials have gone into creating the fuzzy logic algorithms and their inclusion in our new products will substantially increase their power for users,” she added.

“The difference between a real name and a fuzzy version can be very slight and even difficult to notice at first glance,” Barbara said. “But they are different and can make a big difference in the success rate for businesses and organizations working with lists of names.”

Barbara notes, “A sizable majority of the Pro edition of both new products is built with fuzzy logic, but users not ready to dive into the new technology can purchase a Standard edition without fuzzy logic and easily add it later when they are ready by contacting the company for an upgrade.”

As for the easiest part of development, Barbara quickly cited the special precision gender coding information in pdGender filtered for languages, rare usage of unisex names by one gender, and other criteria.

By the time we had established the language information in the master file and flagged name types and rare unisex usages, it was actually quite easy to draw out the gender coding fields,” she said. “This is a testament to the quality of the information and how straightforward it is to work with.”

Barbara said, “The new products do have a learning curve but are ultimately very easy to exploit. It may take a few uses, but those working with the data will appreciate more and more how the information is organized and presented. A lot of thought and field testing has gone into this.”

One result of the decision to build pdNickname and pdGender from the same master file is the strong compatibility between the two offerings.

“While pdNickname and pdGender can easily be used separately, when used jointly they make excellent partners,” Barbara said. “They are comprised of the same set of names and can be linked together with little effort.”

On November 1, 2013, Peacock Data demonstrated the products in front of participants gathered in their Chatsworth, California offices. By this time the new releases were almost ready to go and the development team working under Barbara began tweaking the final layouts and authoring the product documentation.

pdNickname 2.0 Pro and pdGender 2.0 Pro were released on Monday, December 30, 2013 and the Standard editions (without fuzzy logic) made their debut two weeks later.

pdNickname 2.0 Pro has 3.9 million records, including 2.61 million with fuzzy logic, and is 2.9 GB counting all formats and files. pdNickname 2.0 Standard has 1.28 million records, does not have fuzzy logic, and is 964 MG.

pdGender 2.0 Pro has 140,000 records, including 80,000 with fuzzy logic, and is 80.6 MB. pdGender 2.0 Standard has 60,000 records, does not have fuzzy logic, and is 25.5 MB.

Product information

Anatomy of a database, part 1

According to Peacock Data, plans for two just released product upgrades were initially written up in January 2009 and development began in earnest mid-summer of that same year. The products were built during the same development cycle because both are extracted from the same master file.

One of the new products is pdNickname 2.0, an advanced name and nickname file used by businesses and organizations to merge database records. They can match the data against their lists to determine if two or more records are the same individual. It identifies first names that are the same even when they are not an exact match, but rather equivalent, such as a variation or nickname.

The other new package is pdGender 2.0, a gender coding database built on the same set of names. Users can match the data against the first names on their lists to establish male and female identification.

Both upgrades embrace a host of similar and compatible features including languages of origin and use for each name as well as fuzzy logic so information can be recognized even when lists have typographical errors or uncommon spellings.

According to the company’s chief development coordinator Barbara Adair, “Creation of the master name file these new products result from is the biggest venture our company has ever undertaken. There are thousands of sources for names in scores of languages, and our task was to compare and contrast all this data and create the ultimate first name resource.”

Information drawn from the sources includes variant spellings, relationships with other names, and the languages and gender associated with each name.

Barbara pointed out, “The language features have never been available before on this scale and required a sizable portion of the nearly five years of research and development.”

“From the start it was essential to identify the languages associated which each name in considerable detail,” she added. “This gives users previously unavailable ethnic demographics linked to the names already on their lists.”

Barbara showed some of the documents used in construction of the new offerings including a manuscript from 731 AD, written by a monk named Bebe, listing the earliest English names dating from the Anglo-Saxon era of the Early Middle Ages. The still common personal name “Hilda” is an example from the manuscript.

Because sources often give diverse information and use different spelling conventions, it was crucial not only to gather all the information possible but also to differentiate between the quality of sources,” Barbra explained. “Better information became easier to identify after working with the sources over the course of the first year.”

About half the database records are English and Spanish names, and international names originating and used in over 200 other languages make up the second half. This includes such languages as French, German, Chinese, Japanese, Vietnamese, Korean, Hindustani, Arabic, Persian, and Yiddish as well as Native American names and ancient Greek, Latin, and Hebrew names.

According to Barbara, “Special attention is paid to rare usages of unisex names like Kimberly, Hillary, Valentine, and even Maria. Names like these, while usually associated with one gender, are also occasionally employed by both genders. The new products identify rare usages so they can be considered separately. pdGender in particular employs this technology out-of-the-box allowing users to ignore rare unisex usages when assigning gender.”

“Beyond just identifying the languages of use, we also classify name origins, such as Old English opposed to Middle English opposed to modern English,” Barbara noted. “This adds value for those researching personal names or the relationships between languages, such as in the fields of anthroponymy, onomatology, ethnology, and linguistics.”

According to the product documentation, both packages identify five basic first name types:

  • Base Names
  • Variations
  • Short Form Nicknames
  • Diminutives
  • Opposite Gender Forms

Assigning a type identification to each name was a lengthy part of development, but it is significant because the added information permits more precise filtering and ultimately better results,” Barbara said. “Base names are characteristically the oldest because they are the original names all later formations can be traced back to. A lot of time was devoted to these. It is important they are identified as accurately as possible because the remainder of the database is dependent on them.”

Most of the main product development was completed by the end of 2012 and field testing of beta editions commenced in January 2013.

See Anatomy of a database, part 2 for the rest of the story.

Product information

ACS vs. Summary File 3

The American Community Survey (ACS), which replaces the Census long form Summary File 3 (SF3), is a primarily mail-based household survey conducted by the U.S. Census Bureau with an annual sample size of about 3.5 million addresses and a response rate said to exceed 97 percent. Like SF3, it produces estimates for numerous social, economic, and housing characteristics. These estimates are summarized for geographic areas ranging from neighborhoods to Congressional districts to states to the entire nation. The smallest geographic entity presented is at the Census Block Group level.

The ACS shares many similarities with SF3. However, there are many differences. The chief advantage of ACS data is its far more frequent release. It collects responses continuously instead of every ten years. This gives planners at all levels of government, business, and the general public far more current data than the decennial long form, and provides for the first time information about temporary populations, such as beach and ski communities.

But this advantage is also a disadvantage. While the ACS is timelier, information is also smoothed (flattened) out and has a lower accuracy rate because it is conducted over years of time instead of at a single point in time. This is particularly prevalent for small geographic areas which must pool three or five years of data to accumulate a large enough sample for reliable estimates.

There are also differences in residence rules, boundaries and definitions of geographic areas, how and which questions are asked, and survey methodology.

Our pdACS2013 package is available for those wanting to try out the new ACS data.

A brief history of encryption

Encryption, part of the science of cryptography, is the process of transforming information using an algorithm (called a cipher) to make it unreadable to anyone except those possessing a special key to unlock the information. The unlocking process reverses the encryption and is known as decryption. According to a 2007 report by the Computer Security Institute, 71% of the companies surveyed utilize encryption for some of their data in transit while 53% utilized encryption for some of their data in storage.

Encryption, however, is not an idea that was born in the computer age. It has an extensive past and has long been used by militaries, diplomats, and governments to facilitate secret communications. Here is a brief history:

Writing

Early encryption was solely concerned with converting messages into unreadable groups of figures to protect the content during the time the message was being carried from one place to another or otherwise in public view. The first approach to cryptography was the simple writing of a message. It worked because most people could not read.

As early as four-thousand years ago, Egyptian scribes decorate the tombs of deceased rulers and kings with irregular hieroglyphs that some experts believe were intended to hide the meaning of the words. Others content they were designed to make the text appear more regal and important.

The ancient Chinese utilized the ideographic nature of their language to hide the meaning of words, but unlike most cryptographic methods, it was used for personal correspondences, and not for military intelligence. Genghis Khan, for example, is not believed to have employed encryption.

Scytale

The first known encryption device, invented by the Spartans around 400 BC, was a cylinder shaped baton or staff called a scytale. The coding technique consisted of wrapping a strip of leather or parchment around a scytale of a specific diameter. The sender wrote the message down the length of the staff, and then unwrapped the material, resulting in the order of the letters being scrambled. The recipient would then wrap the parchment back around a scytale of the same diameter, and the letters would be in the proper order again.

Caesar cipher

Next came the Caesar cipher which uses a simple system, today known as a substitution cipher, where each letter is shifted a specific number of positions up or down the alphabet. The recipient would decrypt the message by knowing in advance how many positions were needed to reverse the substitution or, probably more often, by trial and error until the correct shift was determined. It is named after Julius Caesar (100 BC–44 BC) because he used it to protect correspondence of military significance, but other substitution ciphers are known to have been used earlier. As a historical note, Caesar used a shift of three, and the second most common shift in ancient times was four.

Another simple substitution cipher is the atbash for the Hebrew alphabet. It is mentioned in the Old Testament Book of Jeremiah and involves replacing the first letter of the alphabet with the last, the second letter with the second last, and continuing in that manner.

More complex ciphers

Simple substitution methods of cryptography were the norm for about one thousand years, but during the Middle Ages turned ineffective as the practice became better known and the population more literate. Aided by advances in mathematics, they were replaced by various random substitution methods as well as an approach known as columnar transposition where messages are written in columns, and then the columns are rearranged.

Italian author and cryptographer Leon Battista Alberti (1404–1472), who has been called the “Father of Western Cryptography”, invented the polyalphabetic substitution method which uses multiple substitution alphabets. A form known as the Vigenère cipher, which employs a series of different Caesar ciphers based on the letters of a keyword, withstood three centuries of attempts to break it. It was still being used by the confederates during the American Civil War, but their messages were regularly cracked by the Union.

A German cryptographer named Johannes Trithemius (1462–1516) wrote Polygraphiae, published in 1518, the first printed book on cryptology. He also invented a steganographic cipher in which each letter was represented by a word taken from a sequence of columns, with the resulting series of words making up prayer.

Cipher wheel

During his term as George Washington’s secretary of state (1790–1793), Thomas Jefferson devised a machine he called a “cipher wheel” consisting of twenty-six cylindrical wooden pieces threaded onto an iron spindle. The letters of the alphabet were inscribed on the edge of each wheel in a random order and turning the wheel scrambled and unscrambled the message. The recipient would decrypt the message with another cipher wheel and trial and error until the correct combination was determined.

Jefferson learned the importance of coded messages while he served as the U.S. minister to France (1784–1789) and found that European postmasters regularly opened and read all letters passing through their command.

ENIGMA

The most famous encryption device is ENIGMA, an electro-mechanical rotor cipher machine invented by German engineer Arthur Scherbius at the end of World War I and used by NAZI Germany during World War II. It employs multiple substitution ciphers that rotate iteratively, thereby minimizing the danger of frequency analysis discovering the key. U.S. cryptographers did however break the code during the war, which was a celebrated event for the Allies and affected military intelligence for the remainder of the war.

Modern times

Since the dawn of the computer era, more and more information has become accessible through the internet, both while at rest on servers and while in transmission from one network to another, advancing the needs and requirements of encryption by leaps and bounds. Modern versions now include such enhancements as message integrity checking, identity authentication, and digital signatures.

Today encryption is particularly important for protecting credit card numbers, Social Security numbers, passwords, personal identification numbers (PINs), client lists, business plans, and corporate intelligence. However, for many it is also important for concealing political dissent as well as opposition in the workplace.

Pretty Good Privacy

Today the best known encryption for business and private use is Pretty Good Privacy (PGP) developed by Philip R. “Phil” Zimmermann, Jr. Originally designed as a human rights tool, PGP was published for free on the internet in 1991. It is now owned by PGP Corporation (a part of Symantec).

There is no known method which will allow someone to break PGP encryption, and it has been described by computer security specialist Bruce Schneier as “the closest you’re likely to get to military-grade encryption.” Nonetheless, early versions of PGP were found to have theoretical vulnerabilities, so new editions have been released and development continues.

PGP and similar products follow the OpenPGP standard for encrypting and decrypting data. Defined by the OpenPGP Working Group of the Internet Engineering Task Force (IETF) in Proposed Standard RFC 4880, OpenPGP is today the most widely used email encryption standard in the world.

Advanced Encryption Standard

The most secure encryption at present is recognized as Advanced Encryption Standard (AES) developed by Belgian cryptographers Joan Daemen and Vincent Rijmen. It is based on the Rijndael key schedule which combines multiple transformations. The cipher is described as a symmetric-key algorithm in which the same key is used for both encryption and decryption. AES has been adopted by the U.S. government and is now used worldwide.

Recent concerns

Security uncertainties involving encryption have been recent news items. The most public example occurred in May and June of 2013. Citing documents leaked by former National Security Agency (NSA) contract employee Edward Snowden, U.K. newspaper The Guardian reported that the NSA and its British counterpart, the GCHQ, have cracked ciphers that are widely used to protect online information, including email and other communications, banking and medical records, and corporate data.

For years prior, some have expressed alarm that intelligence and defense agencies throughout the world, including the NSA the U.S. defense department, spend billions of dollars to crack encryption codes.

Security experts nevertheless maintain, if properly implemented and maintained, encryption is still the best practical way to secure data.

The internet is a wonderful resource that allows us to communicate instantly with anyone, virtually anywhere—but it also allows access to cyber criminals and electronic hackers who can steal millions of dollars or wreak havoc in seconds.

It is safest to keep sensitive information on removable storage media such as portable external hard drives and flash memory drives. However, these can be lost, and this is not a useful security solution in many situations.

In a great majority cases the impact of unwanted disclosure of information is negligible—but not always. Do not discount encryption as too difficult, time consuming, or expensive. It is not and in some cases is required by law. There is a good chance one day you will be glad you utilize it.

What are jam values?

In the world of statistical databases, jam values are common. They are hard-coded information displayed instead of derived measures. They are used to represent unique situations where either the information to be conveyed is an explanation for the absence of data, represented by a symbol in the data display, such as a dot “.”, or the information to be conveyed is an open-ended distribution, such as 1,000,000 or greater, represented by 1000001. Even an empty value or a zero (“0”) will often have a special meaning.

Jam values can also be used to explain why information cannot be disclosed, such as for privacy reasons or because the data does not meet certain filtering criteria.

Data definitions are provided with jam values and normally they are not difficult to utilize. Depending on the parameters of the project, some may even choose to ignore them altogether. However, it is important to understand that they exist, and are not treated like numbers in tabulations and analysis.

Special consideration must be taken when importing data with jam values because they are often in alpha form. Non-numeric values will be converted to zeros (0) if appended into fields that accept only numeric information.

Some users convert jam values to special numbers during the import process so numeric fields can be used. Numeric fields are easier to work with because they do not have to be converted when counted or used in calculations.

You will find jam values frequently employed in American Community Survey (ACS) estimates and margins of error as well as in our own ACS demographics product, pdACS2013.

Using the pdNickname RELFLAG field

is a unique nearly 50,000 record database designed to facilitate comparing sets of first name data based on nicknames, diminutives, pet names, variations and given names. One of the most important fields in the database product is RELFLAG, which stands for “Relationship Flag”.

The RELFLAG field contains one of two possible values:

1 = Close relationship between the name and variation (common variants): Includes closely associated nicknames, diminutives and pet names as well as first name variations that are considered closely related.

2 = More distant relationship between the name and variation (less common variants): Includes alternate forms of the names, often deriving from another culture, as well as nicknames, diminutives and pet names that are relatively uncommon.

PDNICKNAME VARIATIONS FOR THE GIVEN NAME
pdNickname variations for the given name “SAMUAL&rdquo
The RELFLAG field indicates if the name and variation have a (1) close or (2) more distant relationship.

The RELFLAG field is useful for controlling what is to be considered an acceptable match. As more distant relationships are included in matches, the error rate naturally rises. The error rate increase is usually not substantial, but it is measurable in hundredths and tenths of a percent.

RECOMMENDATIONS

RESIDENTIAL: While additional accuracy can be achieved if only close relationships are considered, with residential lists, the margin of error rate increase is almost always very small even when the more distant relationships are included—rarely more than 0.02% in our testing. Therefore, under best practices, it is fully acceptable to use all RELFLAG relationships when matching residential lists. With the exception of the George Foreman family, most errors that might occur result from different given name that share the same nickname or other variation.

BUSINESS AND ORGANIZATION LISTS: On the other hand, with business and organization lists, when the more distant relationships are included the margin of error rate increase is typically higher, compared to residential lists. However, our testing normally shows an increase that is still less than 0.1%, but we have seen it as high as 0.3% with some large lists. Under best practices, it is recommended that only close relationships be considered when processing business and organization lists.

How to zip and unzip files

The following instructions show how to compress and uncompress a file under the Windows, Mac, and Linux operating environments. Note than under most systems you can select multiple files as well as folders/directories to zip into the same archive.

Zipping a file creates a compressed version of the file that is usually considerably smaller than the original file. The zipped version of the file has a .zip file extension.

Unzipping a file reverses the zip process and extracts the file from the compressed archive.

IMPORTANT: Please be cautious about opening .zip files from unknown e-mail senders because they can contain viruses. Confirm with known senders before opening a .zip file.

WINDOWS ME, XP, VISTA, 7, 8 and SERVER 2003 (or higher)

Compress files (zip files)

  1. Locate the file you want to compress.
  2. Right-click the file; then point to Send to and click Compressed (zipped) folder.

    A new compressed zip file is created in the same location.

Uncompress files (unzip files)

  1. Locate the compressed zip file you want to extract.
  2. Do one of the following:
    • To extract a single file or folder, double-click the zip file to open it; then drag the file or folder from the archive to a new location.
    • To extract the entire contents of the zip file, right-click the zip file; then click Extract All; then follow the instructions on the screen.

Windows 95, 98, 98SE and 2000 do not have built in zip file support, and it is necessary to utilize third-party software to create and extract zip files.

MAC OS X (10–10.4)

Compress files (zip files)

  1. Use Finder to locate the file you want to compress.
  2. Control-click or right-click the file icon; then click Compress [. . .]

    A new compressed zip file is created in the same location.

Uncompress files (unzip files)

  1. Use Finder to locate the compressed zip file you want to extract.
  2. Double-click the file icon.

    The files contained in the archive will be extracted to the same location.

MAC OS X (10.5–LION)

Compress files (zip files)

  1. Use Finder to locate the file you want to compress.
  2. Control-click or right-click the file icon; then click Create Archive of [. . .]

    A new compressed zip file is created in the same location.

Uncompress files (unzip files)

  1. Use Finder to locate the compressed zip file you want to extract.
  2. Double-click the file icon.

    The files contained in the archive will be extracted to the same location.

LINUX

Compress files (zip files)

  1. Open a shell prompt.
  2. Enter the following: zip -r filename.zip filedir

    A new compressed zip file is created in the selected location.

Uncompress files (unzip files)

  1. Open a shell prompt.
  2. Enter the following: unzip filename.zip

    The files contained in the archive will be extracted to the same location as the zip file.

CASS vs. DSF2

With respect to database management, one of the questions we are most often asked concerns the difference between CASS and DSF2 address/mail processing. These services have grown more similar in recent years, but there are still important differences.

CASS

CASS – Coding Accuracy Support System – standardizes addresses and improves the accuracy of carrier route, five-digit ZIP, ZIP+4 and delivery point codes that appear on mail pieces. When mailers CASS certify their lists, they can become eligible for bulk mail discounts from the U.S. Postal Service (USPS).

In the past, applying these codes only required that mailing address fall within a valid address range. The address itself was not necessarily accurate. For example, ZIP+4 codes typically represent a range of 100 or fewer house numbers on a given street. While 100-199 Rollins Road may be assigned to ZIP+4 code 91311-7007, not all those house numbers are actual delivery points.

Since November 2, 2006, the USPS has determined that CASS-related discounts will only apply when the agency’s Delivery Point Validation (DPV) process confirms the primary number—or the first line—of the addresses. If the primary street number of an address cannot be confirmed with DPV, then no ZIP+4 code is assigned to the mail. Without a ZIP+4 code, the mail ceases to qualify for postal discounts. DPV is therefore now included with CASS processing.

CASS also integrates Locatable Address Conversion System service (LACSLink) which converts some PO Box and rural route addresses to city-style addresses and updates addresses that have been renumbered or renamed.

CASS processing software is graded by the USPS National Customer Support Center (NCSC), located in Memphis, Tennessee, by processing a test file. To be certified a 98.5% accuracy rate is required.

DSF2

Prior to the November 2006 requirements added to CASS processing, DSF2 – Delivery Sequence File Second Generation – was the only fail-safe method of checking the accuracy of mailing addresses since CASS only determined if the address fell within a valid address range. But DSF2 continues to offer advantages that can make it a good choice.

In addition to all the processing and validation supplied with CASS, DSF2 also provides mailers with other important address attributes. It classifies an address as residential or business, identifies the delivery type—curb-side, door-slot, Neighborhood Delivery and Collection Box Unit (NDCBU) or central delivery—and flags seasonal and vacant addresses.

This information allows for more targeted mailings. For example, a company may want to send out a sample box of laundry detergent—with DSF2 they can target their mailing to residential addresses that do not have a door-slot mailbox since the sample would not fit through the slot.

DSF2 also appends codes that can be used to qualify for USPS Enhanced Carrier Route (ECR) Basic, High Density and Saturation postal discounts. These include Pseudo Delivery Sequence Numbers for walk sequence postal presort discounts and Enhanced Line of Travel codes (eLOT).

CASS and DSF2 have similarities and important differences. The choice is determined by the targeting required and the postal discounts that can be qualify for. In both cases, however, users will typically benefit from a significant reduction in undeliverable mail, speedier delivery and lower costs.

Regularly review database systems

Maintaining a successful database infrastructure requires regular review to establish what is going right and where problems may be lurking. This should consist of close consideration of all aspects of the database system, including: hardware & software, administration & input, documentation, staff & training and disaster recovery.

The frequency of database system reviews is dependent on the size of a company or organization and the complexity of the system. At a minimum, we recommend quarterly reviews, but many situations warrant more frequent action.

Prior to conducting any database system review it is important to establish a set of benchmarks and checklists to compare against the findings.

HARDWARE & SOFTWARE: A database system review should begin with an assessment of the applications, computers, workstations, network servers and other devices that underlie and run the system. Decide if they are meeting the expectations and requirements of end users and if they are doing so efficiently. Needs change and technologies grow rapidly, and keeping on top of the machinery is essential in a competitive environment.

ADMINISTRATION & INPUT: Take a long look at how the database system is administered and the input into the tables. Determine if there has been unauthorized augmentation of a database, particularly rogue changes, or if unapproved or non-standard practices and tools are utilized. This will help maintain the system’s physical and logical models as well as prevent costly downtime and gaps in performance.

DOCUMENTATION: The fun often stops for tech people when they have to hang up their programming tools and put their work down on paper. Consequently, documentation is often lacking for database systems. Make sure this is not the case because end use of the system is highly affected. Documentation should include a data dictionary and reflect the current physical and logical state of the infrastructure as well as be understandable to the less tech savvy.

STAFF & TRAINING: The people part of the review is very important because a database system is only as good as those that run and use it. It is important to align duties properly, and the staff needs to have the necessary expertise and training to adequately leverage the technology and be equipped to handle new complexities in the infrastructure. Investment in this area can reap large rewards.

DISASTER RECOVERY: Last but definitely not least, asses the database system in terms of its ability to recover from a disaster. Backups need to be performed regularly and properly stored, and it is vital this includes offsite backup. Additionally, make sure there is an adequate plan for unforeseen complications and worst-case scenarios and that the system’s immunization against viruses, worms and other web-based attacks is at full strength. This is particularly important when there are substantial changes to the database infrastructure.

Database management can become overwhelming as requirements escalate and the volume of data mushrooms. Regular review of a database system is essential to preserve the return on investment, meet objectives and insure long-term success.

Be kind—don’t pollute telephone fields

One of the most exasperating things about processing telephone data is all the junk that often gets added next to the numbers. Little notes like “cell”, “parent’s phone”, “call before 8 p.m.” and “disconnected” can wreak havoc when the information is processed with telephone update services or sent through merge/purge, as well as when utilized in-house.

Having the extra information included is particularly destructive when trying to verify telephone numbers or running a reverse append. Often these numbers are flagged as invalid and not properly processed.

Merge/purge can also be harmfully affected if telephone information is used in the match criteria and matches are missed due to the “database pollution”.

The little notes can be a problem in-house as well. They show up when printing telephone lists and when supplying data to your call center. And they can also cause issues when performing queries on the tables.

If notes about telephone numbers are necessary, a separate field should be provided and utilized by the end user. And if the problem persists, database managers can limit the size of the telephone fields so there is not enough room for the notes.

Most importantly, lessons in database etiquette, including a list of do’s and don’ts, should be included in training for anyone who accesses the database tables.

A company or organization’s data resources are among its most important assets. Following a few simple rules when accessing them can greatly help maintain their value and extend their effectiveness.

Be kind—don’t pollute!