Augmented Classification of text with Watson Natural Language Understanding and IBM Data Science experience
Read this in other languages: 한국어.
In this developer journey we will use Jupyter notebooks in IBM Data Science experience(DSX) to augment IBM Watson Natural Language Understanding API output through configurable mechanism for text classification.
When the reader has completed this journey, they will understand how to:
- Create and run a Jupyter notebook in DSX.
- Use DSX Object Storage to access data and configuration files.
- Use IBM Watson Natural Language Understanding API to extract metadata from documents in Jupyter notebooks.
- Extract and format unstructured data using simplified Python functions.
- Use a configuration file to build configurable and layered classification grammar.
- Use the combination of grammatical classification and regex patterns from a configuration file to classify word token classes.
- Store the processed output JSON in DSX Object Storage.
The intended audience for this journey is developers who want to learn a method for augumenting classification metadata obtained from Watson Natural Language Understanding API, in situations when there is a scarcity of historical data. The traditional approach of training a Text Analytics model yields less than expected results. The distinguishing factor of this journey is that it allows a configurable mechanism of text classification. It helps give a developer a head start in the case of text from a specialized domain, with no generally available English parser.
-
IBM Data Science Experience: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
-
Bluemix Object Storage: A Bluemix service that provides an unstructured cloud data store to build and deliver cost effective apps and services with high reliability and fast speed to market.
-
Watson Natural Language Understanding: A Bluemix service that can analyze text to extract meta-data from content such as concepts, entities, keywords, categories, sentiment, emotion, relations, semantic roles, using natural language understanding.
- Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Follow these steps to setup and run this developer journey. The steps are described in detail below.
- Sign up for the Data Science Experience
- Create Bluemix services
- Create the notebook
- Add the data and configuraton file
- Update the notebook with service credentials
- Run the notebook
- Download the results
- Analyze the results
Sign up for IBM's Data Science Experience. By signing up for the Data Science Experience, two services: DSX-Spark
and DSX-ObjectStore
will be created in your Bluemix account.
Create the following Bluemix service and name it wdc-NLU-service:
Use the menu on the top to select Projects
and then Default Project
.
Click on Add notebooks
(upper right) to create a notebook.
- Select the
From URL
tab. - Enter a name for the notebook.
- Optionally, enter a description for the notebook.
- Enter this Notebook URL: https://github.com/IBM/watson-document-classifier/blob/master/notebooks/watson_document_classifier.ipynb
- Click the
Create Notebook
button.
- From the
My Projects > Default
page, UseFind and Add Data
(look for the10/01
icon) and itsFiles
tab. - Click
browse
and navigate to this repowatson-document-classifier/data/sample_text.txt
- Click
browse
and navigate to this repowatson-document-classifier/configuration/sample_config.txt
Note: It is possible to use your own data and configuration files. If you use a configuration file from your computer, make sure to conform to the JSON structure given in
configuration/sample_config.txt
.
If you use your own data and configuration files, you will need to update the variables that refer to the data and configuration files in the Jupyter Notebook.
In the notebook, update the global variables in the cell following 2.3 Global Variables
section.
Replace the sampleTextFileName
with the name of your data file and sampleConfigFileName
with your configuration file name.
Select the cell below 2.1 Add your service credentials from Bluemix for the Watson services
section in the notebook to update the credentials for Watson Natural Langauage Understanding.
Open the Watson Natural Language Understanding service in your Bluemix Dashboard and click on your service, which you should have named wdc-NLU-service
.
Once the service is open click the Service Credentials
menu on the left.
In the Service Credentials
that opens up in the UI, select whichever Credentials
you would like to use in the notebook from the KEY NAME
column. Click View credentials
and copy username
and password
key values that appear on the UI in JSON format.
Update the username
and password
key values in the cell below 2.1 Add your service credentials from Bluemix for the Watson services
section.
-
Select the cell below
2.2 Add your service credentials for Object Storage
section in the notebook to update the credentials for Object Store. -
Delete the contents of the cell
-
Use
Find and Add Data
(look for the10/01
icon) and itsFiles
tab. You should see the file names uploaded earlier. Make sure your active cell is the empty one below2.2 Add...
-
Select
Insert to code
(below your sample_text.txt). -
Click
Insert Crendentials
from drop down menu. -
Make sure the credentials are saved as
credentials_1
.
When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
IMPORTANT: The first time you run your notebook, you will need to install the necessary packages in section 1.1 and then
Restart the kernel
.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
- A blank, this indicates that the cell has never been executed.
- A number, this number represents the relative order this code step was executed.
- A
*
, this indicates that the cell is currently executing.
There are several ways to execute the code cells in your notebook:
- One cell at a time.
- Select the cell, and then press the
Play
button in the toolbar.
- Select the cell, and then press the
- Batch mode, in sequential order.
- From the
Cell
menu bar, there are several options available. For example, you canRun All
cells in your notebook, or you canRun All Below
, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.
- From the
- At a scheduled time.
- Press the
Schedule
button located in the top right section of your notebook panel. Here you can schedule your notebook to be executed once at some future time, or repeatedly at your specified interval.
- Press the
- To see the results, go to DSX-ObjectStore
- Click on the name of your object storage
- Click on the Container with the name you gave your Notebook
- Select
sample_text_classification.txt
file using select box to the left of the file listing - Click the
SelectAction
button and use theDownload File
drop down menu to downloadsample_text_classification.txt
file.
After running each cell of the notebook under Classify text, the results will display.
The configuration json controls the way the text is classified. The classification process is divided into stages - Base Tagging and Domain Tagging. The Base Tagging stage can be used to specify keywords based classification, regular expression based classification, and tagging based on chunking expressions. The Domain Tagging stage can be used to specify classification that is specific to the domain, in order to augment the results from Watson Natural Language Understanding.
We can modify the configuration json to add more keywords or add regular expressions. In this way, we can augment the text classification without any changes to the code. We can add more stages to the configuration json if required and enhance the text classification results with code modifications.
It can be seen from the classification results that the keywords and regular expressions specified in the configuration have been correctly classified in the analyzed text that is displayed.