Custom Task Algorithms¶
Building custom task algorithms in ClarityNLP is a way to create custom algorithms and include external libraries that are callable from NLPQL. To begin creating custom task algorithms, you need a few things to get started.
Create a Python Class¶
In nlp/custom_tasks
, create a new Python class that extends BaseTask
. BaseTask
is a class that sets up the data, context and connections needed to read and write data in ClarityNLP.
See the source code for BaseTask
at nlp/tasks/task_utilities.py
. You can start with the sample below and copy and paste the basic structure that you need for a custom task.
from tasks.task_utilities import BaseTask
from pymongo import MongoClient
class SampleTask(BaseTask):
task_name = "MyCustomTask"
def run_custom_task(self, temp_file, mongo_client: MongoClient):
for doc in self.docs:
# you can get sentences and text through these utility methods
text = self.get_document_text(doc)
sentences = self.get_document_sentences(doc)
# put your custom algorithm here, save your output to a dictionary, and write results below
obj = {
'foo': 'bar',
'sentence_one': sentences[0]
}
# writing results
self.write_result_data(temp_file, mongo_client, doc, obj)
# writing to log (optional)
self.write_log_data("DONE", "done writing sample data")
Task Name¶
task_name
is important to include in your SampleTask if you want a user-friendly name to be called from NLPQL.
In this example, the task name is MyCustomTask
, but if a custom task wasn’t specified, the task name would be SampleTask
.
Also, it’s important to be aware with naming that you can overwrite other custom tasks and even core ClarityNLP tasks (which may be the desired outcome). So in most cases, you’ll want to provide a unique name.
Running a Custom Task¶
The ClarityNLP engine will automatically create a distributed job and assign a set of documents to each worker task. Knowing that, there are just a few things to do to create custom tasks. You’ll need to implement the run_custom_task
function in your task.
That will give you access to the self
parameter which has attributes from the job and the set of documents your algorithm will run on. You don’t need to worry about them too much, but know they are accessible in your custom task.
You also have access to a temp_file
which is provided by Luigi. It’s not necessarily used by ClarityNLP, but you may wish to use it for some logging purpose (logging will be discussed more below). In addition, you have a mongo_client
connection that is opened and closed for you, however you’ll need access to this object when you’re writing output for your ClarityNLP NLPQL.
Iterating over Documents¶
Since you are responsible for a set of documents, you need to get the list of documents which has been assigned to this worker. This is callable by using self.docs
and should be iterable in Python.
Per document (or doc
), there are few helper functions available for you.
self.get_document_text(doc)
- gets the text of document as a stringself.get_document_sentences(doc)
- gets a list of the sentences in a document, parsed with the default ClarityNLP sentence segmentor
Accessing Custom Variables¶
If you have custom parameters in your NLPQL, you can access them via the custom_arguments
dictionary in your pipeline config.
my_value = self.pipeline_config.custom_arguments['my_value']
Saving Results¶
All data that ClarityNLP uses in NLPQL needs to eventually end up in MongoDB. BaseTask
provides two types of hooks, depending on whether you have a single object or a list of objects. Both return the new unique id (or ids) from MongoDB.
self.write_result_data(temp_file, mongo_client, doc, obj)
- saves results whereobj
is a Pythondict
self.write_multiple_result_data(temp_file, mongo_client, doc, obj_list)
- saves results whereobj_list
is a Pythonlist
orset
(implies multiple results per document)
Logging and Debugging¶
ClarityNLP provides two means for logging and debugging your custom tasks. Most commonly you will use the first method, where you pass in a status and description text. This is written to the Postgres database, and accessible when users call the status function on their NLPQL jobs.
self.write_log_data("DONE!", "done writing sample data")
The second is less common, but may be desirable in certain cases, which is writing to the temp_file
used by Luigi, e.g.:
temp_file.write("Some pretty long message that maybe I don't want to show to users")
This is written to the file system and generally not accessible to users via APIs.
Using Custom Collectors¶
Collectors in ClarityNLP are similar to the reduce step in map-reduce jobs. They can be implemented similar to custom tasks, except their purpose is generally to summarize across all the data generated in the parallelized Luigi tasks.
To utilize the collector, extend the BaseCollector
class, and make sure your collector_name
in that class is the same as your task_name
in your custom task.
class MyCustomCollector(BaseCollector):
collector_name = 'cool_custom_stuff'
def custom_cleanup(self, pipeline_id, job, owner, pipeline_type, pipeline_config, client, db):
print('custom cleanup (optional)')
def run_custom_task(self, pipeline_id, job, owner, pipeline_type, pipeline_config, client, db):
print('run custom task collector')
# TODO write out some summary stats to mongodb
class MyCustomTask(BaseTask):
task_name = 'cool_custom_stuff'
def run_custom_task(self, temp_file, mongo_client: MongoClient):
print('run custom task')
for doc in self.docs:
# TODO write out some data to mongodb about these docs
Collectors often are not needed, but may be necessary for certain algorithm implementations.
Setting up the Python Package¶
ClarityNLP automatically discovers any classes in the custom_task
package. However, besides saving your Python file in custom_tasks
, you just need to make sure it’s included in the custom_tasks
package by adding it to nlp/custom_tasks/__init__.py
, following the example:
from .SampleTask import SampleTask
Calling Custom Algorithms from NLPQL¶
To run your custom algorithm in NLPQL, you just need to call it by name as a function like the example below, and make sure to pass in any variables needed for the config and Solr query.
define sampleTask:
Clarity.MyCustomTask({
documentset: [ProviderNotes],
"my_custom_argument": 42
});
Custom Algorithm or External Library?¶
There aren’t too many limitations on what you build inside of custom tasks and collectors, given that it’s a something that can input text, and output a Python object. This is a powerful feature that will allow you to integrate many types of capabilities into ClarityNLP!
Other Conventions¶
While the previous sections contain the main items you need to create custom task algorithms in ClarityNLP, here’s some other information that might be useful.
- Default Value: Or using
value
as the default field. In NLPQL, when no field name is specified, it will default tovalue
. This means that you may want to provide avalue
field in your resulting object that gets saved to MongoDB, so that there’s a default value - Sentences: While there’s no requirement to parse or run your algorithm at the sentence level, it is useful for scoping and user validation. Therefore, in most of the core ClarityNLP algorithms, output
sentence
is part of the result, and you may wish to follow this paradigm - Metadata: All the metadata from the job is automatically saved for you, however you may have additional metadata you want to save from your algorithm or source data