TranscriptWizard™ and TranscriptWizard™ Plus
TranscriptWizard is perhaps the most difficult 'simple program' I have ever written. It's primary job is to convert an ASCII deposition transcript into PDF. Simple conversion from one text format into another.
The program was written to solve a common problem among litigators: Often an attorney will receive transcripts of expert witnesses from prior cases. These transcripts will be in ASCII format. But plain ASCII isn't convenient to read, either on screen or on paper. So it generally needs reformatting and while we're at it, let's put it into a standard format. Also, litigators like 'condensing' things.
The needs of this program started very simple:
- read the ASCII text
- determine lines and pages
- output to PDF:
- one page per sheet
- two pages per sheet
- four pages per sheet (condensed)
Now for the additional REAL requirements:
- line/page numbering must be respected as these documents are cited
- it's all worthless without an indexed word list
- can you hyperlink the indexed word list?
- can you put the Case, Date and Witness name in the page header?
- highlight Questions and/or Answers
- can I attached rather than embed my exhibits to the PDF?
After reviewing more of the requirements and studying how these documents are actually used, I added a few more requirements:
- Key-Word-in-Context index
You can check out the programs at:
Some comments on the process
I chose wxWidgets for cross-platform compatibility. And after much searching and evaluation, I chose libHaru for PDF generation.
It became obvious from the first unit tests that "standard ASCII format" was an oxymoron. While reading text files is rather straightforward, these files came in many flavors and required a state machine to read them. Some of the files are good enough to tell you the page numbers. Some want you to guess. Some put a whole bunch of numbers on a page and figure you'll know which are line numbers as opposed to page numbers as opposed to just a bunch of numbers.
Figuring line numbers should be easy, because it's the first number of a line, right? But what do you do when you find text on an unnumbered line between line '3' and line '4'? Oh, it's a 'half line' is it? How did line '712' get between line '5' and line '6'? Oh, street number on a 'half line', not a line number. What line number is '04:21:39'? Oh, timestamp, right. Because transcripts all have timestamps on each line, unless they don't. Then there's: Answers are preceded by an 'A:' while Questions are preceded by a 'Q:', unless they're not. Needless to say, it was a lot of fun learning about depositions' 'standard' format.
I created a document class to store the inputted text. From this document class a word index could be created. Pretty straight forward, take all the words that are part of the testimony and stick them in a std::map. And the testimony is different from the rest of the document how? Turns out that knowing the parts of a document is easy for a human, not so easy for a machine. And don't forget that 'Obvious' and 'obvious' are the same word, one just starts a sentence. Unlike 'Providence' which is a proper noun, except for when it's not.
Once a word index is constructed, it must be stripped of common words. Things people say too often that don't add meaning to their dialog. Ya know - like 'umm'. Yes, people actually say 'umm' during a deposition. And yes, the court reporter actually types it out. Also, don't forget to exclude 'object' when a lawyer says "I object!". But please do include it for things like "the object of my desire".
By now you get the point that this was a simple program full of things that people take for granted that a machine can not. And it's all done in a language that sometimes resembles English.