USE Encoder (use_enc
)
Module use_enc
uses the universal sentence encoder to generate sentence level embeddings. The sentence level embeddings can then be used to calculate the similarity between two given text via cosine similarity and/or dot product.
For this tutorial we are going to leverage the Use encoder
for Zero-shot text classification.
- Preparing dataset for evaluation
- Import Use-Encoder(use_enc) module in jac
- Evaluate the models effectiveness
Walk through
1. Praparing dataset
For this tutorial, we are going to leverage the use_encoder
for text classification, which is categorizing an incoming text into a one of predefined class. for demonstration purpose, we are going to use the SNIPS dataset as an example here. snips dataset.
SNIPS is a popular intent classificawtion datasets that covers intents such as [ "BookRestaurant", "ComparePlaces", "GetDirections", "GetPlaceDetails", "GetTrafficInformation", "GetWeather", "RequestRide", "SearchPlace", "ShareCurrentLocation", "ShareETA" ]
We need to do a little data format conversion to create a version of SNIPS that work with our use_encoder
implemenation.
For this part, we are going to use Python. First,
-
Import the dataset
from huggingface dataset library.# import library from datasets import load_dataset # load dataset dataset = load_dataset("snips_built_in_intents") print(dataset["train"][:2])
If imported successsfuly, you should see the data format to be something like this in output
{"text": ["Share my location with Hillary's sister", "Send my current location to my father"], "label": [5, 5]}
-
Converting the format
from the SNIPS out of the box to the format that can be ingested by use_encoder.import pandas as pd from sklearn.model_selection import train_test_split import json # get labels names lab = dataset["train"].features["label"].names # create labels dictionary label_dict = {v: k for v, k in enumerate(lab)} # dataset dataset = dataset["train"] # create dataset function def CreateData(data): # Create dataframe df = pd.DataFrame(data) # Map labels dict on label column df["label"] = df["label"].apply(lambda x : label_dict[x]) # grouping text on basis of label df = df.groupby("label").agg({"text": "\t".join, "label": "\t".join}) df["label"] = df["label"].apply(lambda x: x.split("\t")[0]) # Create data dictionary data_dict = {} for i in range(len(df)): data_dict[df["label"][i]] = df["text"][i].split("\t") return data_dict # Split dataset: Create test dataset and store in json file `test.json` and save to disk. _, test = train_test_split(dataset, test_size=0.2, random_state=42) # Create test dataset test_data = CreateData(test) data = [] classes = [] for itms in test_data: if itms not in classes: classes.append(itms) for text in test_data[itms]: data.append({ "text": text, "class":itms }) test_dataset = {"text":data, "classes":classes} # write data in json file 'test.json' with open("test.json", "w", encoding="utf8") as f: f.write(json.dumps(test_dataset, indent = 4))
The resulting format should look something like this.
-
test.json
{ "text": [ { "text": "Book a table at Galli for 6 people tonight", "class": "BookRestaurant" }, { "text": "What's the best hotel between Soho Grand and Paramount Hotel?", "class": "ComparePlaces" }, { "text": "Give me transit directions from Grand Central to Brooklyn bridge", "class": "GetDirections" }, { "text": "What's today's menu at The Water Club?", "class": "GetPlaceDetails" }, { "text": "Should I expect traffic from here to Kennedy international airport?", "class": "GetTrafficInformation" }, { "text": "What will the weather be like tomorrow morning?", "class": "GetWeather" }, { "text": "I need an Uber right now", "class": "RequestRide" }, { "text": "Find me the closest theatre for tonight", "class": "SearchPlace" }, { "text": "Share my location to mum until I get to school", "class": "ShareCurrentLocation" }, { "text": "Send my time of arrival to Nina", "class": "ShareETA" } ], "classes": [ "BookRestaurant", "ComparePlaces", "GetDirections", "GetPlaceDetails", "GetTrafficInformation", "GetWeather", "RequestRide", "SearchPlace", "ShareCurrentLocation", "ShareETA" ] }
-
2. Import USE-Encoder(use_enc) module in jac
-
Open terminal and run jaseci by follow command.
jsctl -m
-
Load
use_enc
module in jac by commandactions load module jac_nlp.use_enc
3. Evaluate the models effectiveness
For this tutorial, we are going to evaluation of text classification
with use_encoder
for intent classification its tested on snips test dataset, which is categorizing an incoming text into a one of predefined class.
-
Creating Jac Program (evaluation use_enc)
-
Create a file by name use_enc.jac
-
Create node model_dir and use_encoder in use_enc.jac file
node model_dir; node use_encoder {};
-
Initializing node use_encoder and import text classify train and infer ability inside node.
# import ability can use.text_similarity, use.text_classify;
-
Initialize module eval_text_classification inside use_encoder node.
# classify text amd evaluate text classification can eval_text_classification with eval_text_classification entry{ test_data = file.load_json(visitor.test_file); classes = test_data["classes"]; result = []; for itm in test_data["text"]{ text = itm["text"]; class_true = itm["class"]; resp = use.text_classify( text = text, classes = classes.list ); result.list::append({"text":text,"class_true":class_true,"class_pred":resp["match"]}); } fn = "result_use_enc.json"; file.dump_json(fn, result); }
Parameter details
- Input:
text
(string): text to classifyclasses
(list of strings): candidate classification classes
- output:
dict
will containmatching class
,match index
andscore
for the evaluation we are passing here test data file e.g.
test.json
for evaluation. - Input:
-
Adding edge name of
use_model
inuse_enc.jac
file for connecting nodes inside graph.# adding edge edge use_model { has model_type; }
-
Adding graph name of
use_encoder_graph
for initializing node .graph use_encoder_graph { has anchor use_model_dir; spawn { use_model_dir = spawn node::model_dir; use_encoder_node = spawn node::use_encoder; use_model_dir -[use_model(model_type="use_encoder")]-> use_encoder_node; } }
-
Initializing walker init for calling graph
walker init { root { spawn here ++> graph::use_encoder_graph; } }
-
Creating walker name of
eval_text_classification
for getting parameter from context or default and calling abilitytext_classify
.# Declaring the walker: walker eval_text_classification{ has test_file="test.json"; root { take --> node::model_dir; } model_dir { take -->; } }
Final use_enc.jac program
node model_dir; node use_encoder{ # import all module ability can use.text_classify; # evaluate text classification can eval_text_classification with eval_text_classification entry{ test_data = file.load_json(visitor.test_file); classes = test_data["classes"]; result = []; for itm in test_data["text"]{ text = itm["text"]; class_true = itm["class"]; resp = use.text_classify( text = text, classes = classes.list ); result.list::append({"text":text,"class_true":class_true,"class_pred":resp["match"]}); } fn = "result_use_enc.json"; file.dump_json(fn, result); } } # adding edge edge use_model { has model_type; } # creating graph graph use_encoder_graph { has anchor use_model_dir; spawn { use_model_dir = spawn node::model_dir; use_encoder_node = spawn node::use_encoder; use_model_dir -[use_model(model_type="use_encoder")]-> use_encoder_node; } } # initialize init walker walker init { root { spawn here ++> graph::use_encoder_graph; } } # Declaring the walker fro calling : walker eval_text_classification{ has test_file="test.json"; root { take --> node::model_dir; } model_dir { take -->; } }
Steps for running use_enc.jac program
-
Execute the follow command for Build
use_enc.jac
jac build use_enc.jac
-
Execute the follow command to Activate sentinal
sentinel set -snt active:sentinel -mode ir use_enc.jir
Note: If getting error
ValueError: badly formed hexadecimal UUID string
execute only oncesentinel register -set_active true -mode ir use_enc.jir
-
Execute the walker
eval_text_classification
with default parameter for evaluationuse_enc
module by following commandwalker run eval_text_classification
After executing walker
eval_text_classification
result data will store in fileresult_use_enc.json
in your current local path.Evaluation Result
Evaluation accuracy score : 0.0303 Evaluation F1_score : 0.0495 Evaluation classification_report : precision recall f1-score support BookRestaurant 0.00 0.00 0.00 17 ComparePlaces 0.50 0.33 0.40 3 GetDirections 0.06 0.33 0.10 3 GetPlaceDetails 0.00 0.00 0.00 13 GetTrafficInformation 0.00 0.00 0.00 1 GetWeather 0.00 0.00 0.00 13 RequestRide 0.00 0.00 0.00 2 SearchPlace 0.00 0.00 0.00 5 ShareCurrentLocation 0.00 0.00 0.00 4 ShareETA 0.00 0.00 0.00 5 accuracy 0.03 66 macro avg 0.06 0.07 0.05 66 weighted avg 0.03 0.03 0.02 66
-
Batch Encoding
What is batch encoding?
Batch encoding is a technique of converting a list
of categorical variables into numerical values so that it could be easily fitted to a machine learning model.
In jaseci, how you implement it is by importing the USE model from the AI Kit and using the encode
function, instead of passing a string you will pass an array to the encode
function.
walker get_encoding {
can use.encode;
can list_of_text = ["I hate bananas", "guava is a decent fruit", "I love apples"];
report use.encode(list_of_text);
}
Based on the result that was spits out from the model each encoding is corresponding respective to how you list it in an array for e.g. the encoding of "i hate bananas" is located in the first element of the data returned from the encoding result, "guava is a decent fruit" is located in the second element of the data returned from the encoding result, vice versa.