NLPQL Expression Evaluation¶
Overview¶
In this section we describe the mechanisms that ClarityNLP uses to evaluate
NLPQL expressions. NLPQL expressions are found in define
statments such as:
define hasFever:
where Temperature.value >= 100.4;
define hasSymptoms:
where hasFever AND (hasDyspnea OR hasTachycardia);
The expressions in each statement consist of everything between the where
keyword and the semicolon:
Temperature.value >= 100.4
hasFever AND (hasDyspnea OR hasTachycardia)
NLPQL expressions can either be mathematical or logical in nature, as these examples illustrate.
Recall that the processing stages for a ClarityNLP job proceed roughly as follows:
- Parse the NLPQL file and determine which NLP tasks to run.
- Formulate a Solr query to find relevant source documents, partition the source documents into batches, and assign batches to computational tasks.
- Run the tasks in parallel and write individual task results to MongoDB. Each individual result from an NLP task comprises a task result document in the Mongo database. The term document is used here in the MongoDB sense, meaning an object containing key-value pairs. The MongoDB ‘documents’ should not be confused with the Solr source documents, which are electronic health records.
- Evaluate NLPQL expressions using the task result documents as the source data. Write expression evaluation results to MongoDB as separate result documents.
Thus ClarityNLP evaluates expressions after all tasks have finished running and have written their individual results to MongoDB. The expression evaluator consumes the task results inside MongoDB and uses them to generate new results from the expression statements.
We now turn our attention to a description of how the expression evaluator works.
The expression evaluator is built upon the MongoDB aggregation framework. Why use MongoDB aggregation to evaluate NLPQL expressions? The basic reason is that ClarityNLP writes results from each run to a MongoDB collection, and it is more efficient to evaluate expressions using MongoDB facilities than to use something else. Use of a non-Mongo evaluator would require ClarityNLP to:
- Run a set of queries to extract the data from MongoDB
- Transmit the query results across a network (if the Mongo instance is hosted remotely)
- Ingest the query results into another evaluation engine
- Evaluate the NLPQL expressions and generate results
- Transmit the results back to the Mongo host (if the Mongo instance is hosted remotely)
- Insert the results into MongoDB.
Evaluation via the MongoDB aggregation framework is more efficient than this process, since all data resides inside MongoDB.
NLPQL Expression Types¶
In the descriptions below we refer to NLPQL variables, which have the
form nlpql_feature.field_name
. The NLPQL feature is a label introduced in a
define
statement. The field_name
is the name of an output field
generated by the task associated with the NLPQL feature.
The output field names from ClarityNLP tasks can be found in the NLPQL Reference.
1. Simple Mathematical Expressions¶
A simple mathematical expression is a string containing NLPQL variables, operators, parentheses, or numeric literals. Some examples:
Temperature.value >= 100.4
(Meas.dimension_X > 5) AND (Meas.dimension_X < 20)
(0 == Temperature.value % 20) OR (1 == Temperature.value % 20)
The variables in a simple mathematical expression all refer to a single NLPQL feature.
Simple mathematical expressions produce a result from data contained in a single task result document. The result of the expression evaluation is written to a new MongoDB result document.
2. Simple Logic Expressions¶
A simple logic expression is a string containing NLPQL features,
parentheses, and the logic operators AND
, OR
, and NOT
.
For instance:
hasRigors OR hasDyspnea
hasFever AND (hasDyspnea OR hasTachycardia)
(hasShock OR hasDyspnea) AND (hasTachycardia OR hasNausea)
(hasFever AND hasNausea) NOT (hasRigors OR hasDyspnea)
Logic expressions operate on high-level NLPQL features, not on numeric literals or NLPQL variables. The presence of a numeric literal or NLPQL variable indicates that the expression is either a mathematical expression or possibly invalid.
Simple logic expressions produce a result from data contained in one or more task result documents. In other words, logic expressions operate on sets of result documents. The result from the logical expression evaluation is written to one or more new MongoDB result documents (the details will be explained below).
The NOT
operator requires additional commentary. ClarityNLP supports the
use of NOT
as a synonym for “set difference”. Thus A NOT B
means
all elements of set A
that are NOT also elements of set B
. The use of
NOT
to mean “set complement” is not supported. Hence expressions such as
NOT A
, NOT hasRigors
, etc., are invalid NLPQL statements. The NOT
operator must appear between two other expressions.
3. Mixed Expressions¶
A mixed expression is a string containing either:
- A mathematical expression and a logic expression
- A mathematical expression using variables involving two or more NLPQL features
For instance:
// both math and logic
(Temperature.value >= 100.4) AND (hasDyspnea OR hasTachycardia)
// two NLPQL features: LesionMeasurement and Temperature
(LesionMeasurement.dimension_X >= 10) OR (Temperature.value >= 100.4)
// math, logic, and multiple NLPQL features
Temperature.value >= 100.4 AND (hasRigors OR hasNausea) AND (LesionMeasurement.dimension_X >= 15)
The evaluation mechanisms used for mathematical, logic, and mixed expressions are quite different. To fully understand the issues involved, it is helpful to first understand the meaning of the ‘intermediate’ and ‘final’ phenotype results.
Phenotype Result CSV Files¶
Upon submission of a new job, ClarityNLP prints information to stdout that looks similar to this:
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 1024
Access-Control-Allow-Origin: *
Server: Werkzeug/0.14.1 Python/3.6.4
Date: Fri, 23 Nov 2018 18:40:38 GMT
{
"job_id": "11108",
"phenotype_id": "11020",
"phenotype_config": "http://localhost:5000/phenotype_id/11020",
"pipeline_ids": [
12529,
12530,
12531,
12532,
12533,
12534,
12535
],
"pipeline_configs": [
"http://localhost:5000/pipeline_id/12529",
"http://localhost:5000/pipeline_id/12530",
"http://localhost:5000/pipeline_id/12531",
"http://localhost:5000/pipeline_id/12532",
"http://localhost:5000/pipeline_id/12533",
"http://localhost:5000/pipeline_id/12534",
"http://localhost:5000/pipeline_id/12535"
],
"status_endpoint": "http://localhost:5000/status/11108",
"results_viewer": "?job=11108",
"luigi_task_monitoring": "http://localhost:8082/static/visualiser/index.html#search__search=job=11108",
"intermediate_results_csv": "http://localhost:5000/job_results/11108/phenotype_intermediate",
"main_results_csv": "http://localhost:5000/job_results/11108/phenotype"
}
Here we see various items relevant to the job submission. Each submission
receives a job_id, which is a unique numerical identifier for the run.
ClarityNLP writes all task results from all jobs to the phenotype_results
collection in a Mongo database named nlp
. The job_id is
needed to distinguish the data belonging to each run. Results can be extracted
directly from the database by issuing MongoDB queries.
We also see URLs for ‘intermediate’ and ‘main’ phenotype results. These are
convenience APIs that export the results to CSV files. The data in the
intermediate result CSV file contains the output from each NLPQL
task not marked as final
. The main result CSV contains the results
from any final tasks or final expression evaluations. The CSV file can be
viewed in Excel or in another spreadsheet application.
Each NLP task generates a result document distinguished by a particular value
of the nlpql_feature
field. The define statement
define hasFever:
where Temperature.value >= 100.4;
generates a set of rows in the intermediate CSV file with the
nlpql_feature field set to hasFever
. The NLP tasks
// nlpql_feature 'hasRigors'
define hasRigors:
Clarity.ProviderAssertion({
termset: [RigorsTerms],
documentset: [ProviderNotes]
});
// nlpql_feature 'hasDyspnea
define hasDyspnea:
Clarity.ProviderAssertion({
termset: [DyspneaTerms],
documentset: [ProviderNotes]
});
generate two blocks of rows in the CSV file, the first block having the
nlpql_feature field set to hasRigors
and the next block having it
set to hasDyspnea
. The different nlpql_feature blocks appear in order
as listed in the source NLPQL file. The presence of these nlpql_feature
blocks makes locating the results of each NLP task a relatively simple
matter.
Expression Evaluation Algorithms¶
ClarityNLP evaluates expressions via a multi-step procedure. In this section we describe the different processing stages.
Expression Tokenization and Parsing¶
The NLPQL front end parses the NLPQL file and sends the raw expression text
to the evaluator (nlp/data_access/expr_eval.py
). The evaluator module
parses the expression text and converts it to a fully-parenthesized token
string. The tokens are separated by whitespace and all operators are replaced
by string mnemonics (such as GE
for the operator >=
, LT
for the
operator <
, etc.).
If the expression includes any subexpressions involving numeric literals, they are evaluated at this stage and the literal subexpression replaced with the result.
Validity Checks¶
The evaluator then runs validity checks on each token. If it finds a token that
it does not recognize, it tries to resolve it into a series of known NLPQL
features separated by logic operators. For instance, if the evaluator were
to encounter the token hasRigorsANDhasDyspnea
under circumstances in which
only hasRigors
and hasDyspnea
were valid NLPQL features, it would
replace this single token with the string hasRigors AND hasDyspnea
. If it
cannot perform the separation (such as with the token
hasRigorsA3NDhasDyspnea
) it reports an error and writes error information
into the log file.
If the validity checks pass, the evaluator next determines the expression type.
The valid types are EXPR_TYPE_MATH
, EXPR_TYPE_LOGIC
, and
EXPR_TYPE_MIXED
. If the expression type cannot be determined, the evaluator
reports an error and writes error information into the log file.
Subexpression Substitution¶
If the expression is of mixed type, the evaluator locates all simple math
subexpressions contained within and replaces them with temporary NLPQL feature
names, thereby converting math subexpressions to logic subexpressions. The
substitution process continues until all mathematical
subexpressions have been replaced with substitute NLPQL features, at which
point the expression type becomes EXPR_TYPE_LOGIC
.
To illustrate the substitution process, consider one of the examples from above:
Temperature.value >= 100.4 AND (hasRigors OR hasNausea) AND (LesionMeasurement.dimension_X >= 15)
This expression is of mixed type, since it contains the mathematical
subexpression Temperature.value >= 100.4
, the logic subexpression
(hasRigors OR hasNausea)
, and the mathematical subexpression
(LesionMeasurement.dimension_X >= 15)
. The NLPQL features in each math
subexpression, Temperature
and LesionMeasurement
, also differ.
The evaluator identifies the Temperature subexpression and replaces it with a
substitute NLPQL feature, m0
(for instance). This transforms the original
expression into:
(m0) AND (hasRigors OR hasNausea) AND (LesionMeasurement.dimension_X >= 15)
Now only one mathematical subexpression remains.
The evaluator again makes a substitution m1
for the remaining mathematical
subexpression, which converts the original into
(m0) AND (hasRigors OR hasNausea) AND (m1)
This is now a pure logic expression.
Thus the substitution process transforms the original mixed-type expression into three subexpressions, each of which is of simple math or simple logic type:
subexpression 1 (m0): 'Temperature.value >= 100.4'
subexpression 2 (m1): 'LesionMeasurement.dimension_X >= 15'
subexpression 3: '(m0) AND (hasRigors OR hasNausea) AND (m1)'
By evaluating each subexpression in order, the result of evaluating the original mixed-type expression can be obtained.
Evaluation of Mathematical Expressions¶
Removal of Unnecessary Parentheses¶
The evaluator next removes all unnecessary pairs of parentheses from the mathematical expression. A pair of parentheses is unnecessary if it can be removed without affecting the result. The evaluator detects changes in the result by converting the expression with a pair of parentheses removed to postfix, then comparing the postfix form with that of the original. If the postfix expressions match, that pair of parentheses was non-essential and can be discarded. The postfix form of the expression has no parentheses, as described below.
Conversion to Explicit Form¶
After removal of nonessential parentheses, the evaluator rewrites the
expression so that the tokens match what’s actually stored in the database.
This involves an explicit comparison for the NLPQL feature and the
unadorned use of the field name for variables. To illustrate, consider the
hasFever
example above:
define hasFever:
where Temperature.value >= 100.4;
The expression portion of this define statement is
Temperature.value >= 100.4
. The evaluator rewrites this as:
(nlpql_feature == Temperature) AND (value >= 100.4)
In this form the tokens match the fields actually stored in the task result documents in MongoDB.
Conversion to Postfix¶
Direct evaluation of an infix expression is complicated by parenthesization and operator precedence issues. The evaluation process can be greatly simplified by first converting the infix expression to postfix form. Postfix expressions require no parentheses, and a simple stack-based evaluator can be used to evaluate them directly.
Accordingly, a conversion to postifx form takes place next. This conversion
process requires an operator precedence table. The NLPQL operator precedence
levels match those of Python and are listed here for reference. Lower numbers
imply lower precedence, so or
has a lower precedence than and
, which
has a lower precedence than +
, etc.
Operator | Precedence Value |
---|---|
( | 0 |
) | 0 |
or | 1 |
and | 2 |
not | 3 |
< | 4 |
<= | 4 |
> | 4 |
>= | 4 |
!= | 4 |
== | 4 |
+ | 9 |
- | 9 |
* | 10 |
/ | 10 |
% | 10 |
^ | 12 |
Conversion from infix to postfix is unambiguous if operator precedence and
associativity are known. Operator precedence is given by the table above.
All NLPQL operators are left-associative except for exponentiation, which is
right-associative. The infix-to-postfix conversion algorithm is the standard
one and can be found in the function _infix_to_postfix
in the file
nlp/data_access/expr_eval.py
.
After conversion to postfix, the hasFever
expression becomes:
'nlpql_feature', 'Temperature', '==', 'value', '100.4', '>=', 'and'
Generation of the Aggregation Pipeline¶
The next task for the evaluator is to convert the expression into a sequence of MongoDB aggregation pipeline stages. This process involves the generation of an initial $match query to filter out everything but the data for the current job. The match query also checks for the existence of all entries in the field list and that they have non-null values. A simple existence check is not sufficient, since a null field actually exists but has a value that cannot be used for computation. Hence checks for existence and a non-null value are both necessary.
For the hasFever
example, the initial match query generates a pipeline
filter stage that looks like this, assuming a job_id of 12345:
{
"$match": {
"job_id": 12345,
"nlpql_feature": {"$exists":True, "$ne":None},
"value" : {"$exists":True, "$ne":None}
}
}
This match pipeline stage runs first and performs coarse filtering on the data in the result database. It finds only those task result documents matching the specified job_id, and it further restricts consideration to those documents having valid entries for the expression’s fields.
Subsequent Pipeline Stages¶
After generation of the initial match filter stage, the postfix expression is then ‘evaluated’ by a stack-based mechanism. The result of the evaluation process is not the actual expression value, but instead a set of MongoDB aggregation commands that tell MongoDB how to compute the result. The evaluation process essentially generates Python dictionaries that obey the aggregation syntax rules. More information about the aggregation pipeline can be found here.
The pipeline actually does a
$project
operation and creates a new document with a Boolean field called value
.
This field has a value of True or False according to whether the source
document satisfied the mathematical expression. The _id
field of the
projected document matches that of the original, so that a simple query on
these _id
fields can be used to recover the desired documents.
The final aggregation pipeline for our example becomes:
// (nlpql_feature == Temperature) and (value >= 100.4)
{
"$match": {
"job_id":12345
"nlpql_feature": {"$exists":True, "$ne":None},
"value" : {"$exists":True, "$ne":None}
}
},
{
"$project" : {
"value" : {
"$and" : [
{"$eq" : ["$nlpql_feature", "Temperature"]},
{"$gte" : ["$value", 100.4]}
]
}
}
}
The completed aggregation pipeline gets sent to MongoDB for evaluation.
Mongo performs the initial filtering operation, applies the subsequent
pipeline stages to all surviving documents, and sets the “value” Boolean
result. A final query extracts the matching documents and writes new result
documents with an nlpql_feature
field equal to the label from the
define
statement, which for this example would be hasFever
.
Evaluation of Logic Expressions¶
The initial stages of the evaluation process for logic expressions proceed similarly to those for mathematical expressions. Unnecessary parentheses are removed and the expression is converted to postfix.
Detection of n-ary AND and OR¶
After the postfix conversion, a pattern matcher looks for instances of n-ary
AND
and/or OR
in the set of postfix tokens. An n-ary OR
would look
like this, for n == 4:
// infix
hasRigors OR hasDyspnea OR hasTachycardia OR hasNausea
// postfix
hasRigors hasDyspnea OR hasTachycardia OR hasNausea OR
The n-value refers to the number of operands. All such n-ary instances are
replaced with a variant form of the operator that includes the count. The
reason for this is that n-ary AND
and OR
can be handled easily by the
aggregation pipeline, and their use simplifies the pipeline construction
process. For this example, the rewritten postfix form would become:
hasRigors hasDyspnea hasTachycardia hasNausea OR4
Generation of the Aggregation Pipeline¶
As with mathematical expressions, the logic expression aggregation pipeline
begins with an initial stage that filters on the job_id and checks that the
nlpql_feature
field exists and is non-null. No explicit field checks are
needed since logic expressions do not use NLPQL variables. For a job_id of
12345, this inital filter stage is:
{
"$match": {
"job_id":12345
"nlpql_feature": {"$exists":True, "$ne":None}
}
}
Following this is another filter stage that removes all docs not having the desired NLPQL features. For the original logic expression example above:
hasFever AND (hasDyspnea OR hasTachycardia)
this second filter stage would look like this:
{
"$match": {
"nlpql_feature": {"$in": ['hasFever', 'hasDyspnea', 'hasTachycardia']}
}
}
Grouping by Value of the Context Variable¶
The next stage in the logic pipeline is to group documents by the value of the context field. Recall that NLPQL files specify a context of either ‘document’ or ‘patient’, meaning that a document-centric or patient-centric view of the results is desired. In a document context, ClarityNLP needs to examine all data pertaining to a given document. In a patient context, it needs to examine all data pertaining to a given patient.
The grouping operation collects all such data (the ClarityNLP task result
documents) that pertain to a given document or a given patient. Documents are
distinguished by their report_id
field, and patients are distinguished by
their patient IDs, which are stored in the subject
field. You can
think of these groups as being the ‘evidence’ for a given document or for
a given patient. If the patient has the conditions expressed in the NLPQL
file, the evidence for it will reside in the group for that patient.
As part of the grouping operation ClarityNLP also generates a set of NLPQL features for each group. This set is called the feature_set and it will be used to evaluate the expression logic for the group as a whole.
The grouping pipeline stage looks like this:
{
"$group": {
"_id": "${0}".format(context_field),
# save only these four fields from each doc; more efficient
# than saving entire doc, uses less memory
"ntuple": {
"$push": {
"_id": "$_id",
"nlpql_feature": "$nlpql_feature",
"subject": "$subject",
"report_id": "$report_id"
}
},
"feature_set": {"$addToSet": "$nlpql_feature"}
}
}
Here we see the $group operator grouping the documents on the value of the context field. An ntuple array is generated for each different value of the context variable. This is the ‘evidence’ as discussed above. Only the essential fields for each document are used, which reduces memory consumption and improves efficiency. We also see the generation of the feature set for each group, in which each NLPQL feature for the group’s documents is added to the set.
At the conclusion of this pipeline stage, each group has two fields: an
ntuple
array that contains the relevant data for each document in the
group, and a feature_set
field that contains the distinct features for
the group.
Logic Operation Stage¶
After the grouping operation, the logic operations of the expression are
applied to the elements of the feature set. If a particular patient
satisfies the hasFever
condition, then at least one document in that
patient’s group will have an NLPQL feature field with the value of
hasFever
. Since all the distinct values of the NLPQL features for the
group are stored in the feature set, the feature set must also have an element
equal to hasFever
.
A check for set membership using aggregation syntax is expressed as:
{"$in": ["hasFever", "$feature_set"]}
This construct means to use the
$in
operator to test whether feature_set
contains the element hasFever
.
The $in
operator returns a Boolean result.
A successful test for feature set membership means that the patient has the stated feature.
The evaluator implements the expression logic by translating it into a series of set membership tests. For our example above, the logic operation pipeline stage becomes:
{
'$match': {
'$expr': {
'$and': [
{'$in': ['hasFever', '$feature_set']},
{
'$or': [
{'$in': ['hasDyspnea', '$feature_set']},
{'$in': ['hasTachycardia', '$feature_set']}
]
}
]
}
}
}
Once again we have a match operation to filter the documents. Only those documents satisfying the expression logic will survive the filter. The $expr operator allows the use of aggregation syntax in contexts where the standard MongoDB query syntax would be required.
Following that we see a series of logic operations for our expression
hasFever AND (hasDyspnea OR hasTachycardia)
. The inner $or
operation
tests the feature set for membership of hasDyspnea
and hasTachycardia
.
If either or both are present, the $or
operator returns True. The result of
the $or
is then used in an $and
operation which tests the feature set
for the presence of hasFever
. If it is also present, the $and
operator
returns True as well, and the document in question survives the filter operation.
To summarize the evaluation process so far: ClarityNLP converts infix logic expressions to postfix form and groups the documents by value of the context variable. It uses a stack-based postfix evaluation mechanism to generate the aggregation statements for the expression logic. Each logic operation is converted to a test for the presence of an NLPQL feature in the feature set.
Final Aggregation Pipeline¶
With these operations the pipeline is complete. The full pipeline for our example is:
// aggregation pipeline for hasFever AND (hasDyspnea OR hasTachycardia)
// filter documents on job_id and check validity of the nlpql_feature field
{
"$match": {
"job_id":12345
"nlpql_feature": {"$exists":True, "$ne":None}
}
},
// filter docs on the desired NLPQL feature values
{
"$match": {
"nlpql_feature": {"$in": ['hasFever', 'hasDyspnea', 'hasTachycardia']}
}
},
// group docs by value of context variable and create feature set
{
"$group": {
"_id": "${0}".format(context_field),
"ntuple": {
"$push": {
"_id": "$_id",
"nlpql_feature": "$nlpql_feature",
"subject": "$subject",
"report_id": "$report_id"
}
},
"feature_set": {"$addToSet": "$nlpql_feature"}
}
},
// perform expression logic on the feature set
{
'$match': {
'$expr': {
'$and': [
{'$in': ['hasFever', '$feature_set']},
{
'$or': [
{'$in': ['hasDyspnea', '$feature_set']},
{'$in': ['hasTachycardia', '$feature_set']}
]
}
]
}
}
}
Result Generation¶
After constructing a math or logic aggregation pipeline, the evaluator runs the
pipeline and receives the results from MongoDB. The result set is either a list
of document ObjectID values (_id
) for a math expression or an ObjectId list
with group info for logic expressions. For math expressions, the documents
whose _id
values appear in the list are queried and written out as the
result set. These documents have their nlpql_feature
field set to that
of the define
statement that contained the expression.
For logic expressions the process is more complex. To help explain what the
evaluator does we present here a representation of the grouped documents after
running the pipeline above, for the expression
hasFever AND (hasDyspnea OR hasTachycardia)
:
ObjectId (_id) | nlpql_feature | subject | report_id |
5c2e9e3431ab5b05db3430e1 | hasDyspnea | 19054 | 798209 |
5c2e9e3431ab5b05db3430e2 | hasDyspnea | 19054 | 798209 |
5c2e9e3431ab5b05db3430e3 | hasDyspnea | 19054 | 798209 |
5c2e9e3431ab5b05db3430e4 | hasDyspnea | 19054 | 798209 |
5c2e9ec931ab5b05db343efa | hasDyspnea | 19054 | 1303796 |
5c2ea2bd31ab5b05db34868c | hasTachycardia | 19054 | 1699977 |
5c2ea2bd31ab5b05db34868d | hasTachycardia | 19054 | 1699977 |
5c2ea35a31ab5b05db348f19 | hasTachycardia | 19054 | 1802359 |
5c2ea3a531ab5b05db3492f6 | hasTachycardia | 19054 | 1905337 |
5c2ea42431ab5b05db34998c | hasTachycardia | 19054 | 1802375 |
5c2ea42431ab5b05db34998d | hasTachycardia | 19054 | 1802375 |
5c2eb55831ab5b05db35097b | hasFever | 19054 | [‘1264178’] |
5c2eb55831ab5b05db350d45 | hasFever | 19054 | [‘1699944’] |
5c2eb55831ab5b05db350d46 | hasFever | 19054 | [‘1699944’] |
Here we see a representation of the document group for patient 19054. This
group of documents can be considered to be the “evidence” for this patient.
In the ObjectID column are the MongoDB ObjectID values for each task result
document or mathematical result document. The nlpql_feature
column
shows which NLPQL feature ClarityNLP found for that document. The subject
column shows that all documents in the group belong to patient 19054, and the
report_id
column shows the document identifier.
We see that patient 19054 has five instances of hasDyspnea
, six instances
of hasTachycardia
, and three instances of hasFever
. You can consider
this group as being composed of three subgroups with five, six, and three
elements each.
ClarityNLP presents result documents in a “flattened” format. For each NLPQL
label introduced in a “define” statement, ClarityNLP generates a set of result
documents containing that label in the nlpql_feature
field. Each result
document also contains a record of the source documents that were used as
evidence for that label.
Flattening of the Result Group¶
To flatten these results and generate a set of output documents labeled by the
hasSymptoms
NLPQL feature (from the original “define” statement),
ClarityNLP essentially has two options:
- generate all possible ways to derive
hasSymptoms
from this data - generate the minimum number of ways to derive
hasSymptoms
from this data (while not ignoring any data)
The maximal result set can be generated by the following reasoning. First,
in how many ways can patient 19054 satisfy the condition
hasDyspnea OR hasTachycardia
? From the data in the table, there are five
ways to satisfy the hasDyspnea
condition and six ways to satisfy the
hasTachycardia
condition, for a total of 5 + 6 = 11 ways. Then, for
each of these ways, there are three ways for the patient to satisfy the
condition hasFever
. Thus there are a total of 3 * (5 + 6) = 3 * 11 = 33
ways for this patient to satisfy the condition
hasFever AND (hasDyspnea OR hasTachycardia)
, which would result in the
generation of 33 output documents under a maximal representation.
The minimal result set can be generated by the following reasoning.
We have seen that there are 11 ways for this patient to satisfy the condition
hasDyspnea OR hasTachycardia
. Each of these must be paired with a
hasFever
, from the logical AND
operator in the expression. By repeating
each of the hasFever
entries, we can “tile” the output and pair a
hasFever
with one of the 11 others. This procedure generates a result set
containing only 11 entries instead of 33. It uses all of the output data, and
it minimizes data redundancy.
In general, the cardinalities of the sets of NLPQL features connected by
logical OR
are added together to compute the number of possible results.
For features connected by logical AND
, the cardinalities are multiplied
to get the total number of possiblilities under a maximal representation (this
is the Cartesian product). Under a minimal representation, the cardinality of
the result is equal to the maximum cardinality of the constitutent subsets.
So which output representation does ClarityNLP use?
ClarityNLP uses the minimal representation of the output data.
Here is what the result set looks like using a minimal representation. Each
of the 11 elements contains a pair of documents, one with the feature
hasFever
and the other having either hasDyspnea
or hasTachycardia
,
as required by the expression. We show only the last four hex digits of the
ObjectID for clarity:
// expression: hasFever AND (hasDyspnea OR hasTachycardia)
('097b', 'hasFever'), ('30e1', 'hasDyspnea')
('0d45', 'hasFever'), ('30e2', 'hasDyspnea')
('0d46', 'hasFever'), ('30e3', 'hasDyspnea')
('097b', 'hasFever'), ('30e4', 'hasDyspnea')
('0d45', 'hasFever'), ('3efa', 'hasDyspnea')
('0d46', 'hasFever'), ('868c', 'hasTachycardia')
('097b', 'hasFever'), ('868d', 'hasTachycardia')
('0d45', 'hasFever'), ('8f19', 'hasTachycardia')
('0d46', 'hasFever'), ('92f6', 'hasTachycardia')
('097b', 'hasFever'), ('998c', 'hasTachycardia')
('0d45', 'hasFever'), ('998d', 'hasTachycardia')
Note that the three hasFever
entries repeat three times, followed by
another repeat of the first two entries to make a total of 11. Each of these
is paired with one of the five hasDyspnea
entries or one of the
six hasTachycardia
entries. No data for this patient has been lost,
and the result is 11 documents in a flattened format satisfying the
logic of the original expression.
Testing the Expression Evaluator¶
There is a comprehensive test program for the expression evaluator in the file
nlp/data/access/expr_tester.py
. The test program requires a running
instance of MongoDB. We strongly recommend running Mongo on the same machine
as the test program to minimize data transfer delays.
The test program loads a data file into MongoDB and evaluates a suite of expressions using the data. The expression logic is separately evaluated with Python set operations. The results from the two evaluations are compared and the tests pass only if both evaluations produce identical sets of patients.
The test program can be run from the command line. For usage info, run with
the --help
option:
python3 ./expr_tester.py --help
The test program assumes that the user has permission create a database without authentication.
To run the test suite with the default options, first launch MongoDB on your local system. Information about how to do that can be found in our native setup guide.
After MongoDB initializes, run the test program with this command, assuming the default Mongo port of 27017:
python3 ./expr_tester.py
If your MongoDB instance is hosted elsewhere or uses a non-default port number, provide the connection parameters explicitly:
python3 ./expr_tester.py --mongohost <ip_address> --mongoport <port_number>
The test program takes several minutes to run. Upon completion it should report that all tests passed.