Topic Extraction Model (topic_ext)
Module topic_ext
implemented for extracting topics from set of documents. This module accepts as input a set of documents and their associated class labels. You may use Jaseci cluster
module to cluster document list to similar groups. This is an example code to demostrate topic_ext
module.
Walk through
1. Import text cluster (topic_ext
) module in jac
- For executing jaseci open terminal and run following command.
jsctl -m
- Load
topic_ext
module in jac shell sessionactions load module jac_nlp.topic_ext
- Load suplimentery modules in jac shell session
actions load module jac_nlp.use_enc actions load module jac_misc.cluster
2. Prepare text for clusters
In this section, we'll take raw text as input, encode it, cluser it and then out put a list of cluster labels associate with each text document.
1. Load the text data
Save the text data in json
format.
walker init{
can file.load_json;
has text = file.load_json("text_data.json");
can use.encode;
has encode = use.encode(visitor.text);
}
2. Create embeddings
Use use_enc
module to encode documents.
walker init{
can file.load_json;
has text = file.load_json("text_data.json");
can use.encode;
has encode = use.encode(visitor.text);
}
3. Reduce dimensions using umap
Use cluster.get_umap
action to reduce features.
walker init{
can file.load_json;
has text = file.load_json("text_data.json");
can use.encode;
has encode = use.encode(visitor.text);
can cluster.get_umap;
final_features = cluster.get_umap(encode,2);
}
4. Cluster documents and get document lables
Use cluster.get_cluster_labels
to get cluster labels.
walker init{
can file.load_json;
has text = file.load_json("text_data.json");
can use.encode;
has encode = use.encode(visitor.text);
can cluster.get_umap;
final_features = cluster.get_umap(encode,2);
can cluster.get_cluster_labels;
labels = cluster.get_cluster_labels(final_features,"hbdscan",2,2);
}
5. Extract topic for each clusters
Use topic_ext.topic_extraction
action to extract top n number of topics from each cluster.
Parameters of topic_ext.topic_extraction
texts
- (list of strings) list of input text documents.labels
- (list of int) list of labels associated with each text documents.n_topics
- (int) - Default 5 - number of topics to extract from each cluster.min_tokens
- (int) - Default 1 - number of minimum words per topic.max_tokens
- (int) - Default 2 - number of maximum words per topic.
walker init{
can file.load_json;
has text = file.load_json("text_data.json");
can use.encode;
has encode = use.encode(visitor.text);
can cluster.get_umap;
final_features = cluster.get_umap(encode,2);
can cluster.get_cluster_labels;
labels = cluster.get_cluster_labels(final_features,"hbdscan",2,2);
can topic_ext.topic_extraction;
topic_dict = topic_ext.topic_extraction(texts=text,classes=labels,n_topics=5);
}
Save the above full code in a file with name topic_extraction.jac
and save the following text data inside the same directory with name text_data.json
.
[
"still waiting card",
"countries supporting",
"card still arrived weeks",
"countries accounts suppor",
"provide support countries",
"waiting week card still coming",
"track card process delivery",
"countries getting support",
"know get card lost",
"send new card",
"still received new card",
"info card delivery",
"new card still come",
"way track delivery card",
"countries currently support"
]
Run the jac code in the terminal with jac run topic_extraction.jac
command. You will see the output as follows;
{
"success": true,
"report": [
{
"0": [
[
"countries",
0.392361531667182
],
[
"support",
0.34487955266445003
],
[
"accounts",
0.1934321572215864
],
[
"supporting",
0.1934321572215864
],
[
"suppor",
0.1934321572215864
]
],
"1": [
[
"delivery",
0.43893761248202734
],
[
"track",
0.36634600373495724
],
[
"way",
0.24618638191838274
],
[
"process",
0.24618638191838274
],
[
"info",
0.24618638191838274
]
],
"2": [
[
"waiting",
0.3358171700903774
],
[
"weeks",
0.22567085009185084
],
[
"know",
0.22567085009185084
],
[
"arrived",
0.22567085009185084
],
[
"coming",
0.22567085009185084
]
],
"3": [
[
"new",
0.5364793041447
],
[
"come",
0.30089446678913445
],
[
"send",
0.30089446678913445
],
[
"received",
0.30089446678913445
],
[
"card",
0.13515503603605478
]
]
}
],
"final_node": "urn:uuid:65d3bfac-c6d5-475c-8a18-3a221b507a4f",
"yielded": false
}
6. Generate headings for each clusters
Use topic_ext.headline_generation
action to generate a relevant heading for each cluster.
Parameters of topic_ext.headline_generation
texts
- (list of strings) list of input text documents.labels
- (list of int) list of labels associated with each text documents.
walker init{
can file.load_json;
has text = file.load_json("text_data.json");
can use.encode;
has encode = use.encode(visitor.text);
can cluster.get_umap;
final_features = cluster.get_umap(encode,2);
can cluster.get_cluster_labels;
labels = cluster.get_cluster_labels(final_features,"hbdscan",2,2);
can topic_ext.topic_extraction;
topic_dict = topic_ext.headline_generation(texts=text,classes=labels);
}
Exepected output:
{
"success": true,
"report": [
{
"0": "Countries Account Suppor",
"1": "Track Card Delivery: Track Card Processing Delivery",
"2": "Waiting Week Card Still Coming",
"3": "Send New Card to New Address"
}
],
"final_node": "urn:uuid:e83f23f5-73d2-42a0-bebf-e87590e0db6e",
"yielded": false
}
**7. Extract keyword from a single document **
Use topic_ext.keyword_extraction
action to generate a relevant heading for each cluster.
Parameters of topic_ext.keyword_extraction
sentence
- (list of strings) list of input text documents.n_words
- (int) number of words or phrases to extract from each cluster..min_tokens
- (int) - Default 1 - number of minimum words per topic.max_tokens
- (int) Default 1 - number of maximum words per topic.diversity
- (float) default 0.02 - The expected level of diversity. Increasing the diversity value will reduce the words with similar meaning in the resulted words list and the vise versa.
walker init{
can topic_ext.keyword_extraction;
has texts = "I like dogs";
report topic_ext.keyword_extraction(sentence=texts, n_words=2, min_tokens = 1, max_tokens = 1, diversity = 0.02);
}
Exepected output:
{
"success": true,
"report": [
[
"dogs",
"like"
]
],
"final_node": "urn:uuid:f91997c7-7d54-44ef-b238-2a8ea2f94418",
"yielded": false
}