Sentence Tokenization

Overview

Sentence tokenization is the process of splitting text into individual sentences. For literature, journalism, and formal documents the tokenization algorithms built in to spaCy perform well, since the tokenizer is trained on a corpus of formal English text. The sentence tokenizer performs less well for electronic health records featuring abbreviations, medical terms, spatial measurements, and other forms not found in standard written English.

ClarityNLP attempts to improve the results of the sentence tokenizer for electronic health records. It does this by looking for the types of textual constructs that confuse the tokenizer and replacing them with single words. The sentence tokenizer will not split an individual word, so the offending text, in replacement form, is preserved intact during the tokenization process. After generating the individual sentences, the reverse substitutions are made, which restores original text in a set of improved sentences. ClarityNLP also performs additional fixups of the sentences to further improve the results. This document will describe the process and illustrate with an example.

Source Code

The source code for the sentence tokenizer is located in nlp/algorithms/segmentation/segmentation.py, with supporting code in nlp/algorithms/segmentation/segmentation_helper.py.

Inputs

The entry point to the sentence tokenizer is the parse_sentences method of the Segmentation class. This function takes a single argument, the text string to be split into sentences.

Outputs

The parse_sentences method returns a list of strings, which are the individual sentences.

Example

1
2
seg_obj = Segmentation()
sentence_list = seg_obj.parse_sentences(my_text)

Algorithm

The improvement process proceeds through several stages, which are:

  1. Perform cleanup operations on the report text.
  2. Perform textual substitutions.
  3. Run the spaCy sentence tokenizer on the cleaned, substituted text.
  4. Find and split two consecutive sentences with no space after the period.
  5. Undo the substitutions.
  6. Perform additional sentence fixups for some easily-detectable errors.
  7. Place all-caps section headers in their own sentence.
  8. Scan the resulting sentences and delete any remaining errors.

Additional explanations for some of these items are provided below.

Text Cleanup

The text cleanup process first searches the report text for cut-and-paste section headers found between (Over) and (Cont) tokens. These headers are often inserted directly into a sentence, producing a confusing result. Here is an example:

“There are two subcentimeter right renal hypodensities, 1 in\n (Over)\n\n [**2728-6-8**] 5:24 PM\n CT CHEST W/CONTRAST; CT ABD & PELVIS W & W/O CONTRAST, ADDL SECTIONSClip # [**Telephone/Fax (1) 103840**]\n Reason: Evaluate for metastasis/lymphadenopathy related to ? GI [**Country **]\n Admitting Diagnosis: UPPER GI BLEED\n Contrast: OMNIPAQUE Amt: 130\n ______________________________________________________________________________\n FINAL REPORT\n (Cont)\n the upper pole and 1 in the lower pole, both of which are too small to\n characterize.”

By looking at this text closely, you can see how the (Over)..(Cont) section has been pasted into this sentence:

“There are two subcentimeter right renal hypodensities, 1 in the upper pole and 1 in the lower pole, both of which are too small to\n characterize.”

The meaning of this passage is not obvious to a human observer on first inspection, and it completely confuses a sentence tokenizer trained on standard English text.

ClarityNLP finds these pasted report headers and removes them.

The next step in the cleanup process is the identification of numbered lists. The numbers are removed and the narrative descriptions following the numbers are retained.

As is visible in the pasted section header example above, electronic health records often contain long runs of dashes, asterisks, or other symbols. These strings are used to delimit sections in the report, but they are of no use for machine interpretation, so ClarityNLP searches for and removes such strings.

Finally, ClarityNLP locates any instances of repeated whitespace (which includes spaces, newlines, and tabs) and replaces them with a single space.

Textual Substitutions

ClarityNLP performs several different types of textual substitution prior to sentence tokenization. All of these constructs can potentially cause problems:

Construct Example
Abbreviations .H/O, Sust. Rel., w/
Vital Signs VS T97.3 P84 BP120/56 RR16 O2Sat98 2LNC
Capitalized Header INDICATION:
Anonymizations [**2728-6-8**], [**Telephone/Fax (1) 103840**]
Contrast Agents Conrast: OMNIPAQUE Amt: 130
Field of View Field of view: 40
Size Measurement 3.1 x 4.2 mm
Dispensing Info Protonix 40 mg p.o. q. day.
Gender Sex: M

ClarityNLP uses regular expressions to find instances of these constructs. Wherever they occur they are replaced with single-word tokens such as “ANON000”, “ABBREV001”, “MEAS002”, etc. Replacements of each type are numbered sequentially. The sentence tokenizer sees these replacements as single words, and it preserves them unchanged through the tokenization process. These replacements can be easily searched for and replaced in the resulting sentences.

Split Consecutive Sentences

The punctuation in electronic health records does not always follow standard forms. Sometimes consecutive sentences in a report have a missing space after the period of the first sentence, which can cause the sentence tokenizer to treat both sentences together as a single run-on sentence. ClarityNLP detects these occurrences and separates the sentences. It also avoids separating valid abbreviations such as C.Diff., G.Jones, etc.

Perform Additional Sentence Fixups

Sometimes the sentence tokenizer generates sentences that begin with a punctuation character such as : or ,. ClarityNLP looks for such occurrences and moves the punctuation to the end of the preceding sentence.

Delete Remaining Errors

ClarityNLP scans the resulting set of sentences and takes these actions:

  • deletes any remaining list numbering
  • deletes any sentences consisting only of list numbering
  • removes any sentences that consist only of ‘#1’, ‘#2’, etc.
  • removes any sentences consisting entirely of nonalphanumeric symbols
  • concatenates sentences that incorrectly split an age in years
  • concatenates sentences that split the subject of a measurement from the measurement

Example

Here is a before and after example illustrating several of the tokenization problems discussed above. The data is taken from one of the reports in the MIMIC data set.

BEFORE: Each numbered string below is a sentence that emerges from the sentence tokenizer without ClarityNLP’s additional processing. Note that the anonymized date and name tokens [** ... **] are broken apart, as are numbered lists, drug dispensing information, vital signs, etc. You can see how the sentence tokenizer performs better for the narrative sections, but the abbreviations and other nonstandard forms confuse it and cause errors:

[  0]	Admission Date:  [
[  1]	**3104-4-26
[  2]	**]     Discharge Date:  [**3104-4-28
[  3]	**]


Service:  CARDIAC CA

CHIEF COMPLAINT:   Dyspnea on exertion.

HISTORY OF PRESENT ILLNESS:
[  4]	This is a 78 year old male with
hypertension and hyperlipidemia who was in his usual state of health until two weeks prior to admission when he noted increasing shortness of breath on exertion, especially with stairs.
[  5]	Since that time, the patient reports decreased exercise tolerance but denied any orthopnea, paroxysmal nocturnal dyspnea, or lower extremity swelling.
[  6]	He denies any dizziness or lightheadedness.
[  7]	He was seen in Dr.
[  8]	[**Last Name (STitle) 23973*
[  9]	*]
[ 10]	[**Name (STitle) 23974
[ 11]	*
[ 12]	*]
[ 13]	Clinic the day of admission and was found to have
high grade infra-nodal heart block and was sent to the Emergency Room.
[ 14]	A central line was placed with temporary
pacing wire placed overnight.
[ 15]	PAST MEDICAL HISTORY:
1.  Hypertension.
[ 16]	2.
[ 17]	Hyperlipidemia.
[ 18]	3.
[ 19]	Exercise thallium stress test in [**3100*
[ 20]	*] showed a small
basal inferior fixed defect.
[ 21]	4.
[ 22]	Mild asthma.
[ 23]	5.
[ 24]	Hemorrhoids.
[ 25]	6.
[ 26]	Colonic polyps.
[ 27]	7.
[ 28]	Left bundle branch block since [
[ 29]	**3098-10-8**].
8.
[ 30]	Bilateral hernia repairs.
[ 31]	ALLERGIES:
[ 32]	He has no known drug allergies.
[ 33]	MEDICATIONS:
[ 34]	1.  Hydrochlorothiazide 12.5 mg
[ 35]	p.o.
[ 36]	q. day.
[ 37]	2.
[ 38]	Lipitor 40 mg
[ 39]	p.o.
[ 40]	q.
[ 41]	h.s.
[ 42]	3.
[ 43]	Enalapril 20
[ 44]	mg p.o. twice a day.
[ 45]	4.
[ 46]	Cardizem 180 mg p.o.
[ 47]	q. day.
[ 48]	5.
[ 49]	Aspirin 81 mg
[ 50]	p.o.
[ 51]	q. day.
[ 52]	SOCIAL HISTORY:
[ 53]	He has a remote tobacco history; quit over
25 years ago.
[ 54]	He has a remote alcohol history; quit over 17
years ago.
[ 55]	FAMILY HISTORY:   Family history of stroke but denies any
family history of coronary artery disease or malignancy.
[ 56]	PHYSICAL EXAMINATION:   Temperature
[ 57]	is 98.0 F.; heart rate 35
to 45; blood pressure 161/32; respiratory rate 19; 98% on room air.
[ 58]	In no acute distress.
[ 59]	Pupils were reactive to light; the left was 3 millimeters to 2 millimeters; on the right it was 2 millimeters to 1 millimeters.
[ 60]	Extraocular movements intact.
[ 61]	Mucous membranes were moist.
[ 62]	Jugular venous pressure at about 7 centimeters.
[ 63]	Lungs were clear to auscultation bilaterally.
[ 64]	He is bradycardic with normal S1 and S2 with I/VI systolic murmur at the apex.
[ 65]	His abdomen was soft, nontender, nondistended, with normoactive bowel sounds.
[ 66]	No edema.
[ 67]	In his extremities he had two plus dorsalis pedis bilaterally.
[ 68]	LABORATORY:
[ 69]	EKG showed sinus with atrial rate of 70, 2:1
heart block with ventricular rate of 35 and an old left bundle branch block.
[ 70]	White blood cell count 11.3, hematocrit 34.6, platelets 298.
[ 71]	Sodium 140, potassium 4.1, chloride 102, bicarbonate 25, BUN
26, creatinine 1.3, glucose 129.
[ 72]	CK 96.
[ 73]	Troponin less than
0.3.

Echocardiogram in [
[ 74]	**3103-2-6
[ 75]	**] showed a large left atrium,
ejection fraction 60 to 65% with mild symmetric left ventricular hypertrophy, trace aortic regurgitation, mild mitral regurgitation.
[ 76]	INR was 1.2, PTT 22.7.
[ 77]	Total cholesterol in [**3104-2-6**]
showed total cholesterol of 161, LDL 89, HDL of 35, triglycerides of 184.
[ 78]	Urinalysis was negative.
[ 79]	Chest x-ray was negative.
[ 80]	HOSPITAL COURSE:
[ 81]	The patient remained stable in the
hospital.
[ 82]	He underwent electrophysiology study and pacemaker placement.
[ 83]	He remained stable and asymptomatic.
[ 84]	He was then discharged home.
[ 85]	DISCHARGE
[ 86]	INSTRUCTIONS:
[ 87]	1.
[ 88]	Not to lift anything heavier than ten pounds for two
weeks with the left arm.
[ 89]	2.
[ 90]	He was asked to call his cardiologist with any fatigue or
shortness of breath.
[ 91]	3.
[ 92]	He was to follow-up in Device Clinic in one week.
[ 93]	4.
[ 94]	He was to follow-up with his cardiologist in two to three
weeks.
[ 95]	DISCHARGE DIAGNOSES:
[ 96]	1.
[ 97]	Complete heart block.
[ 98]	MAJOR
[ 99]	INTERVENTIONS:
[100]	1.
[101]	Transvenous pacer wire placement on [**4-26
[102]	**].
[103]	2.
[104]	Pacemaker placement on [
[105]	**4-27
[106]	**].
[107]	CONDITION ON DISCHARGE:   Stable.

DISCHARGE
[108]	MEDICATIONS:
[109]	1.
[110]	Enalapril 20
[111]	mg p.o. twice a day.
[112]	2.
[113]	Hydrochlorothiazide 12.5 mg p.o.
[114]	q. day.
[115]	3.
[116]	Lipitor 40 mg
[117]	p.o.
[118]	q.
[119]	h.s.
[120]	4.
[121]	Percocet p.r.n.
5.
[122]	Keflex 500 mg p.o.
[123]	q. six hours for three days.
[124]	6.
[125]	Ativan 1 mg p.o.
[126]	q.
[127]	h.s.
[128]	as needed.
[129]	7.
[130]	Diltiazem 180 mg p.o.
[131]	q. day.
[132]	[**First Name8 (NamePattern2)
[133]	*
[134]	*]
[135]	[
[136]	**First Name8 (NamePattern2) 1682
[137]	*
[138]	*]
[139]	[**Name8 (MD)
[140]	*
[141]	*], M.D.  [**MD Number(1) 1683
[142]	**]

Dictated By:[**Name8 (MD) 5378
[143]	**]

MEDQUIST36

D:
[144]	[**3104-4-29
[145]	**]  11:19
T:  [
[146]	*
[147]	*3104-5-2**]  21:56
JOB#:  [
[148]	**Job Number 23975**]

AFTER: Here is the same report after ClarityNLP does the cleanup, substitutions, and additional processing described above:

[  0]	Admission Date: [**3104-4-26**] Discharge Date: [**3104-4-28**] Service:
[  1]	CARDIAC CA CHIEF COMPLAINT:
[  2]	Dyspnea on exertion.
[  3]	HISTORY OF PRESENT ILLNESS:
[  4]	This is a 78 year old male with hypertension and hyperlipidemia who was in his usual state of health until two weeks prior to admission when he noted increasing shortness of breath on exertion, especially with stairs.
[  5]	Since that time, the patient reports decreased exercise tolerance but denied any orthopnea, paroxysmal nocturnal dyspnea, or lower extremity swelling.
[  6]	He denies any dizziness or lightheadedness.
[  7]	He was seen in Dr. [**Last Name (STitle) 23973**] [**Name (STitle) 23974**] Clinic the day of admission and was found to have high grade infra-nodal heart block and was sent to the Emergency Room.
[  8]	A central line was placed with temporary pacing wire placed overnight.
[  9]	PAST MEDICAL HISTORY:
[ 10]	Hypertension.
[ 11]	Hyperlipidemia.
[ 12]	Exercise thallium stress test in [**3100**] showed a small basal inferior fixed defect.
[ 13]	Mild asthma.
[ 14]	Hemorrhoids.
[ 15]	Colonic polyps.
[ 16]	Left bundle branch block since [**3098-10-8**].
[ 17]	Bilateral hernia repairs.
[ 18]	ALLERGIES:
[ 19]	He has no known drug allergies.
[ 20]	MEDICATIONS:
[ 21]	Hydrochlorothiazide 12.5 mg p.o. q. day.
[ 22]	Lipitor 40 mg p.o. q. h.s.
[ 23]	Enalapril 20 mg p.o. twice a day.
[ 24]	Cardizem 180 mg p.o. q. day.
[ 25]	Aspirin 81 mg p.o. q. day.
[ 26]	SOCIAL HISTORY:
[ 27]	He has a remote tobacco history; quit over 25 years ago.
[ 28]	He has a remote alcohol history; quit over 17 years ago.
[ 29]	FAMILY HISTORY:
[ 30]	Family history of stroke but denies any family history of coronary artery disease or malignancy.
[ 31]	PHYSICAL EXAMINATION:
[ 32]	Temperature is 98.0 F.; heart rate 35 to 45; blood pressure 161/32; respiratory rate 19; 98% on room air.
[ 33]	In no acute distress.
[ 34]	Pupils were reactive to light; the left was 3 millimeters to 2 millimeters; on the right it was 2 millimeters to 1 millimeters.
[ 35]	Extraocular movements intact.
[ 36]	Mucous membranes were moist.
[ 37]	Jugular venous pressure at about 7 centimeters.
[ 38]	Lungs were clear to auscultation bilaterally.
[ 39]	He is bradycardic with normal S1 and S2 with I/VI systolic murmur at the apex.
[ 40]	His abdomen was soft, nontender, nondistended, with normoactive bowel sounds.
[ 41]	No edema.
[ 42]	In his extremities he had two plus dorsalis pedis bilaterally.
[ 43]	LABORATORY:
[ 44]	EKG showed sinus with atrial rate of 70, 2:1 heart block with ventricular rate of 35 and an old left bundle branch block.
[ 45]	White blood cell count 11.3, hematocrit 34.6, platelets Sodium 140, potassium 4.1, chloride 102, bicarbonate 25, BUN 26, creatinine 1.3, glucose 129.
[ 46]	CK Troponin less than 0.
[ 47]	Echocardiogram in [**3103-2-6**] showed a large left atrium, ejection fraction 60 to 65% with mild symmetric left ventricular hypertrophy, trace aortic regurgitation, mild mitral regurgitation.
[ 48]	INR was 1.2, PTT 22.
[ 49]	Total cholesterol in [**3104-2-6**] showed total cholesterol of 161, LDL 89, HDL of 35, triglycerides of Urinalysis was negative.
[ 50]	Chest x-ray was negative.
[ 51]	HOSPITAL COURSE:
[ 52]	The patient remained stable in the hospital.
[ 53]	He underwent electrophysiology study and pacemaker placement.
[ 54]	He remained stable and asymptomatic.
[ 55]	He was then discharged home.
[ 56]	DISCHARGE INSTRUCTIONS:
[ 57]	Not to lift anything heavier than ten pounds for two weeks with the left arm.
[ 58]	He was asked to call his cardiologist with any fatigue or shortness of breath.
[ 59]	He was to follow-up in Device Clinic in one week.
[ 60]	He was to follow-up with his cardiologist in two to three weeks.
[ 61]	DISCHARGE DIAGNOSES:
[ 62]	Complete heart block.
[ 63]	MAJOR INTERVENTIONS:
[ 64]	Transvenous pacer wire placement on [**4-26**].
[ 65]	Pacemaker placement on [**4-27**].
[ 66]	CONDITION ON DISCHARGE:
[ 67]	Stable.
[ 68]	DISCHARGE MEDICATIONS:
[ 69]	Enalapril 20 mg p.o. twice a day.
[ 70]	Hydrochlorothiazide 12.5 mg p.o. q. day.
[ 71]	Lipitor 40 mg p.o. q. h.s.
[ 72]	Percocet p.r.n.
[ 73]	Keflex 500 mg p.o. q. six hours for three days.
[ 74]	Ativan 1 mg p.o. q. h.s. as needed.
[ 75]	Diltiazem 180 mg p.o. q. day.
[ 76]	[**First Name8 (NamePattern2) **]
[ 77]	[**First Name8 (NamePattern2) 1682**]
[ 78]	[**Name8 (MD) **], M.D.
[ 79]	[**MD Number(1) 1683**] Dictated By:[**Name8 (MD) 5378**] MEDQUIST36
[ 80]	D:
[ 81]	[**3104-4-29**] 11:19
[ 82]	T:
[ 83]	[**3104-5-2**]
[ 84]	21:56
[ 85]	JOB#:
[ 86]	[**Job Number 23975**]

Note that there are fewer sentences overall, and that each sentence has a much more standard form than those in the ‘before’ panel above. The drug dispensing instructions have been been corrected, the list numbering has been removed, and the patient temperature that was split across sentences 56 and 57 has been restored (new sentence 32).

Command Line Interface

The sentence tokenizer has a command line interface that can be used for inspecting the generated sentences. The input data must be a JSON-formatted file with the proper ClarityNLP fields. This file can be produced by querying SOLR for the reports of interest and dumping the results as a JSON-formatted file. The sentence tokenization module will read the input file, split the text into sentences as described above, and write the results to stdout. Help for the command line interface can be obtained by running this command from the nlp/algorithms/segmentation folder:

python3 ./segmentation.py --help

Some examples:

To tokenize all reports in myreports.json and print each sentence to stdout:

python3 ./segmentation.py --file /path/to/myreports.json

To tokenize only the first 10 reports (indices begin with 0):

python3 ./segmentation.py --file myreports.json --end 9``

To tokenize reports 115 through 134 inclusive, and to also show the report text after cleanup and token substitution (i.e. the actual input to the spaCy sentence tokenizer):

python3 ./segmentation.py --file myreports.json --start 115 --end 134 --debug