The CLS Approach to E-Discovery Processing


A Capital Legal Solutions White Paper
Written Summer, 2009


Processing electronically stored information (ESI) for discovery is a multifaceted challenge. It is both art and science. There is no universally agreed-on path from the beginning of the process to the end. Every company has its own approach to processing, and every project requires a new variation to that approach.

At Capital Legal Solutions (CLS) we take a modular approach. The proprietary processing system we have developed consists of more than forty modules, each designed and coded to perform a specific function and to link seamlessly with other modules. We can rearrange the sequence of the modules. We can include some and exclude others. We can easily customize our workflow to accommodate the unique requirements of each client and each case.

The modular approach yields direct benefits to our clients. Most importantly it makes us flexible, able to mount an agile response to changing circumstances. If the case team suddenly needs to shift direction, CLS can accommodate that need in minimal time with minimal angst.

It is useful for discussion purposes to divide our processing into ten steps. The steps are coordinated and they interlock to ensure the highest quality data ingestion and output. As an added benefit and safety measure, we conduct an automated audit of each of our processes and we provide our clients with detailed reports to document our processing.

The ten steps are:

1. Data Ingestion: we keep track of all incoming data

2. System Filtering: we remove all system files

3. Data Decomposition: we decompose data to its lowest format

4. Metadata Retrieval: we retrieve all available metadata

5. Data Culling: we apply effective, appropriate culling techniques

6. Data Processing: we apply the complete array of ESI processing techniques

7. Data QC and Reporting: we apply the most rigorous quality control protocols

8. Data Delivery: we prepare data for delivery according to client specifications

9. Production: we produce documents accurately in the specified form of production

10. Data Archiving: we archive all data

Let’s take a look at each step in more detail.

1. Data Ingestion

Inventory is the first step of our e-discovery processing. We have devoted considerable time and resources to our chain of custody process to ensure that we properly track and identify each document we receive. This step is completed only after the data has been released to the CLS engineering staff by the project manager and all case requirements have been vetted.

During the inventory process our engineers catalog each object received from the client. They perform initial system metadata extraction and store the results in a secure back-end database.

Each file becomes traceable in our environment through the assignment of a unique CLS identifier that facilitates auditing, organization of the data on our servers, and reporting. Project managers receive initial reports so they can keep clients informed regarding data received.

After completion of the initial inventory we run File Header Identification. During this process we scrutinize the header information of each file to identify the correct file type and extension, since in some cases the original file extension has been altered or otherwise rendered non-transparent. As the files run through this process we compare the derived extension against the original extension and determine the actual document format. Throughout this process we apply internal checks and business rules to yield accurate results and seamless data flow.

2. System Filtering

Our standard process is to remove system files from the data set before further processing. System files – for example, the thousands of files loaded on to a computer when Windows is installed – are usually irrelevant to discovery. Removing them at an early stage saves the time and money that might otherwise be wasted on processing and reviewing them.

We use two methods to remove system files:

Directory Exclude List identifies known system folders across various operating environments. With client consent, we exclude the contents of the folders on this list from subsequent processing.

File Extension Include List specifies the types of files that we will include in subsequent processing. Again, with client consent, we apply the list to identify files that will be included. We currently support and process more than 190 different file types. Some are common, such as the Microsoft Office Suite. Others are specialized such as AutoCAD drawings, audio and video files, and Becky (Japanese mail) files. We can engineer processes for proprietary file types upon client request.

CLS project managers provide detailed reports to the client on all files that are excluded because of system filtering. We also provide the standard Directory Exclude List and File Extension Include List upon request.

3. Data Decomposition

Composite files (also called container files) are files that contain other files, often in compressed format. Some composite files contain email messages and attachments. A PST file, for example, contains Microsoft Outlook email. Other composite files, such as ZIP files, can contain any type of electronic document. A composite file may contain any number of other files; maybe one, maybe hundreds or thousands. And a composite file may contain other composite files.

Our process iterates recursively through each composite file and extracts each item within. In our database we track the relationships between the container and its content so we can always tell whether a file was extracted from a composite file and, if so, which composite file it came from.

4. Metadata Retrieval

Thorough extraction of metadata is one of the most crucial components of e-discovery processing so we capture all available metadata from every object in the data set. This includes both application metadata (metadata created by the native application and stored within the file) as well as file system metadata (metadata created by the operating system).

During this stage we calculate hash values for all objects and store those values in our database. Hash values are “fingerprints” used to uniquely identify documents and facilitate deduplication. Our standard approach is to calculate and store both the MD5 and SHA1 hashes. We also populate a field in our database known as the CLS Filter Date which we will discuss in more detail below. QC at this stage is very important to ensure that all metadata was extracted properly.

It is important to note that CLS extracts metadata from objects in their native environment. For example, when we process Lotus Notes files, we do so in the Notes environment. Some vendors convert Notes email to Outlook format then extract the metadata from the converted Outlook file. This approach risks spoliation and produces output that does not look like Lotus Notes mail. The CLS process results in accurate metadata and a deliverable that looks precisely the way Lotus Notes mail looks.

5. Culling

Culling reduces the amount of information that will be passed downstream for attorney review. This is very important because review is the most expensive component of an e-discovery project. Culling irrelevant data is an efficient way to reduce review time and save money.

CLS offers several standard culling methods and we can work with you to devise customized culling processes if your matter demands such treatment.

In addition to System File Filtering which we discussed earlier, our standard culling methods include:

Date Filtering to include or exclude documents from selected date ranges. The CLS Filter Date mentioned above is the field we use for date filtering. The date in this field is identical for all family members, which means that e-mails and their corresponding attachments receive the same filter date. The filter date for e-mails and attachments is the Sent Date[1]. The filter date value for loose documents is the Modification Date[2].

Folder Filtering to include or exclude specific e-mail folders or operating system folders.

File Extension Filtering to include or exclude documents with specific file extensions.

Deduplication either within or across custodians. CLS deduplicates email at the parent level so as to not destroy parent-child relationships. The hash value for loose documents is based on the content of the file. The hash value for e-mail is based on a hash string algorithm combining several different e-mail field values. CLS will provide documentation of our email hashing process upon request.

Searching to include only documents that meet search criteria that you specify. Keywords are the most common search criteria. CLS reports the search results in a detailed “hit report.” We can also provide custom search reports upon request.

6. Data Processing

CLS performs a variety of case-critical processes during this stage to generate rich and precise output. Among these tasks are:

Text Extraction using in-house tools to capture all the available text in a file. This enables us to support the most accurate full text indexing and searching, crucial to the legal team’s research.

OCR using a highly evolved Optical Character Recognition (OCR) process on the image files in the document population. We also employ OCR on files from which text is not available, converting those files to image format and then OCRing the resultant files. We can OCR documents of any language with good accuracy rates depending on the quality of the input.

Printing[3] (TIFFing) to convert files to Group IV TIFF, JPEG or PDF formats. We also have printing/TIFFing options to address issues unique to certain file types. For example, when we print Excel spreadsheets we can remove hidden rows and columns and blank pages and turn off auto-filtering. We can print Word documents with track changes turned on or off. PowerPoint Presentations can be printed with speaker notes or hidden slides. Many other options are available and we can customize still more options as specific cases demand.

Language Identification to identify the primary language (and potentially the secondary language) of a document based on the nature or content of the text file.

Machine Language Translation to convert a document from one language to another using the industry’s best machine translation software. For example, we can convert from Japanese to English or Romanian to English.

Full Text Searching across the document population. CLS collaborates with clients to come up with the most useful search criteria. Our exception reporting for any file flagged “Text not available” validates the data universe being searched. For early case assessment, we can execute searches across a sample data set to generate clear and informative reports for client review.

Image Conversion to change one image format to another while maintaining original image specifications. We can perform other image manipulation options upon request.

Bates Stamping to permanently burn into images any information needed in one of six document positions per client request and instruction.

Smart Data Analysis to identify whether a document has certain properties that can be crucial to litigation. CLS will deliver standard fields at the client’s choosing to ascertain if a document has any of the following characteristics:

· Is password protected

· Has hidden rows or columns

· Has track changes

· Has print revisions

· Has show revisions

· Has protected worksheet

· Is e-mail body encrypted (on Lotus Notes e-mails)

· Has speaker notes

· Has hidden slide

Document Summary and Key Phrase Extraction utilizing an algorithm that extracts important terms and sentences from a document based on relevance and weight. Our end-users have used this output to enhance and greatly accelerate the review process.

Document Clustering to group documents with similar content using keywords and phrases contained in the documents. This allows the end-user to go through clusters of similar documents quickly, slashing attorney review time – the most expensive aspect of the e-discovery process.

Near Deduplication to group similar, but not identical, documents based on the text of the documents. The user can specify a threshold, i.e., a percentage of similarity. The higher the percentage, the more similar will be the documents returned.

E-mail Threading to reconstruct and reflect the e-mail conversation patterns between individuals, using the relevant metadata fields.

Password Cracking to open password-protected files using several methods including passwords supplied by the client, or a “brute force” option.

XML Parsing using in-house experts who can parse the content of XML files to obtain the output the client requires.

Embedded Object Extraction to extract and render embedded objects from many file formats, ensuring that all needed information is collected from the file.

Native Document Production using standards developed by CLS in consultation with clients. A hash-based log is always provided with native production sets.

Data Transformation using our internally developed utility that supports output to any format and to the various litigation database formats in the industry.

Data Analysis/Special Projects by CLS engineers, programmers, testers and database administrators on staff to help with special projects. We are well-versed in the requirements gathering and analysis, testing and delivery process; our detail-oriented staff will see the process through to the end.

7. Data QC and Reporting

Before we deliver our work product to you our quality assurance specialists verify the delivery volume. Process engineers then execute their QC process by reviewing all files or relying upon a well-established randomization approach used by the Department of Defense to verify data output. During this process we inspect for blank pages, garbled text and improper files, and take appropriate action to rectify any anomalies, reporting to the client as needed.

CLS has more than 40 different processing reports available to clients, including:

Inventory

Data Culling

Deduplication

Searching

Data Delivery

Exceptions

Data Processing Summary

Our exception reporting is very detailed and precise, including a description of why each item qualified as an exception. Our standard exception messages include:

File Format Not Supported
Password Protected File
Corrupt File
Text Not Available
Empty File
File Unable to Render to Image

CLS project managers provide a third and final level of QC before deliverables go to the client.

8. Data Delivery

We confer with clients on their delivery database formats and folder requirements. We output to all application formats, and we work diligently to provide the data in a format that meets the client’s needs. For new projects we often send sample deliverables to ensure that all requirements are met prior to full data delivery. Once a client’s specifications are confirmed, they are stored in a global database for use in future projects. We use the established specifications as the client’s standards until confirmation of change is received.

We determine the appropriate delivery media based on the size of the deliverable. Options for delivery include hard drive, DVD, CD, or secured FTP server. Throughout the process, project managers interact continuously and proactively with clients regarding turnaround time, delivery media and any other issues.

9. Production

CLS produces documents in the form prescribed by the client – native, image or traditional hard copy. We do so whether the client reviewed the documents in our eZReview platform or in some other environment. We support multiple image formats, the most common being TIFF, JPEG and PDF (including searchable PDF). We produce in single page or multipage format. We have produced in specially designated formats for courts and for government agencies.

Our typical turnaround time for production requests is within 24-48 hours of final confirmation, depending on the nature of the project. CLS can also load productions online in eZReview for opposing counsel to review or distribute on CD or DVD using eZRLite (a compact version of eZReview).

10. Data Archiving

After processing, CLS maintains client data on our live servers for approximately three (3) months, after which we back up the retired data to our offline servers. After one year, following the receipt of written authorization from the client, CLS erases all data and provides the client with a Certificate of Destruction.

Summing Up

CLS has used its modular approach to e-discovery processing to great effect on incredibly demanding projects. Our processing is accurate, reliable, fast and flexible. Above all, it is battle-tested and proven. And here are a few final facts:

· Our processes are supported by a staff that is highly trained and highly motivated. The CLS engineering staff is far deeper than that of any other e-discovery company.

· We employ the latest technologies. Our infrastructure is state-of-the-art. Our algorithms are mathematically sound.

· We cultivate all of our processes in-house. This allows us full control over our workflow, including the ability to create an “alternate universe” outside our standard workflow to support special client needs.

· We are an international firm that is 100% Unicode-compliant and language-independent. We can manage any data set accurately and effectively, regardless of language.



[1] If the Sent Date cannot be used, the filter date is populated from the Modification Date. If the Modification Date cannot be used, the Creation Date is used.

[2] If the Modification Date cannot be used, the Creation Date is used.

[3] The method used to convert a file to image format (usually TIFF format) is to print it. Instead of sending the output to paper, however, it is captured as a picture and saved as a file on the hard drive. In e-discovery when you see the word “printing,”, think “TIFFing.”