Finding Time Expressions

Overview

ClarityNLP includes a module that locates time expressions in clinical text. By ‘time expression’ we mean a string such as 9:41 AM, 05:12:24.12345, or something similar. The TimeFinder module scans sentences for time expressions, extracts them, and generates output in JSON format.

Source Code

The source code for the time finder module is located in nlp/algorithms/finder/time_finder.py.

Inputs

A single string, the sentence to be scanned for time expressions.

Outputs

A JSON array containing these fields for each time expression found:

Field Name Explanation
text string, text of the complete time expression
start integer, offset of the first character in the matching text
end integer, offset of the final character in the matching text plus 1
hours integer hours
minutes integer minutes
seconds integer seconds
fractional_seconds string, contains digits after decimal point, including any leading zeros
am_pm string, either STR_AM or STR_PM (see values below)
timezone string, timezone code
gmt_delta_sign sign of the UTC offset, either ‘+’ or ‘-’
gmt_delta_hours integer, UTC hour offset
gmt_delta_minutes integer, UTC minute offset

All JSON results contain an identical number of fields. Any fields that are not valid for a given time expression will have a value of EMPTY_FIELD and should be ignored.

Algorithm

ClarityNLP uses a set of regular expressions to recognize time expressions. The time_finder module scans a sentence with each time-finding regex and keeps track of any matches. If any matches overlap, an overlap resolution process is used to select a winniner. Each winning match is converted to a TimeValue namedtuple. This object is defined at the top of the source code module and can be imported by other Python code. Each namedtuple is appended to a list as the sentence is scanned. After scanning completes, the list of TimeValue namedtuples is converted to JSON and returned to the caller.

Time Expression Formats

Using notation similar to that used by the PHP time reference, as well as the Wikipedia article on ISO 8601 formats, we define the following quantities:

Shorthand Meaning
h hour digit, 0-9
h12 12 hr. clock, hours only, 0-9
h24 24 hr. clock, hours only, zero-padded, 00-24
m minutes digit, 0-9
mm minutes, zero-padded, 00-59
ss seconds, zero-padded 00-60 (60 means leap second)
am_pm am or pm designator, can be am or pm, either lower or upper case, with each letter optionally followed by a . symbol
t either t or T
f fractional seconds digit
? optional
utc_time hh, hh:mm, hhmm, hh:mm:ss, hhmmss, hh:mm:ss.ffffff, hhmmss.ffffff

With these definitions, the time expression formats that ClarityNLP recognizes are:

Time Expression Format Examples
utc-timeZ 10:14:03Z
utc_time+-hh:mm 10:14:03+01:30, 10:14:03-01:30
utc_time+-hhmm 10:14:03+0130, 10:14:03-0130
utc_time+-hh 10:14:03+01, 10:14:03-01
YYYY-MM-DDTHH:MM:SS(.ffffff)? 1969-07-20T10:14:03.123456
(here MM:SS means minutes and seconds)  
h12 am_pm 4 am, 5PM, 10a.m., 9 pm.
h12m am_pm 5:09 am, 9:41 P.M., 10:02 AM.
h12ms am_pm 06:10:37 am, 10:19:36P.M., 1:02:03AM
h12msf 7:11:39:012345 am, 11:41:22.22334 p.m.
h12m 4:08, 10:14, and 11:59
t?h24m 14:12, 01:27, 10:27, T23:43
t?h24ms 01:03:24, T14:15:16
t?h24msf 04:08:37.81412, 19:20:21.532453, 08:11:40:123456
t?hhmm 0613, t0613
t?hhmmss 232120, 120000
t?h24ms with timezone abbreviation 040837CEST, 112345 PST, T093000 Z
t?h24ms with GMT offset T192021-0700, 14:45:15+03:30

A list of world time zone abbreviations can be found here. ClarityNLP supports this list as well as Z, meaning “Zulu” or UTC time.