Unlocking Insights: TelBench's Role in Advancing Telecommunications
Introduction
Benchmarking the performance of the Telco LLM that is tuned with well-designed data is a key element in the development of the Telco LLM. SKT's AI Tech Collaboration Group comprehensively measures the performance of LLMs on a battery of tasks, ranging from general tasks that measure the reasoning ability or language ability of general models to telco-specific tasks that measure the ability to perform tasks specialized to the telco domain. SKT’s team of fantastic linguists designed the tasks and the benchmark data.
Benchmarking is performed at regular intervals, and from each round of benchmarking, we are able to glean key insights into each LLM. When we benchmark, we can closely examine areas of strength for an LLM, as well as areas that need improvement. We can also objectively measure performance of an LLM by comparing it with other models. These carefully designed benchmarks take into account not only LLM capability but also business perspectives. The business perspective is particularly important, as it’s a measure of how useful and effective the LLM will be when deployed in an actual business use case.
Benchmark Task Design
General Tasks
LLMs have a diverse set of capabilities that lead to good performance in terms of general language skills. Ensuring that the LLM is performant on general tasks is an important part of benchmarking, as this should be an area of strength for LLMs. As such, we created benchmarks to test LLMs’ basic abilities on the Korean language, including reading comprehension and meaning based on context. These tasks are important when benchmarking an LLM, but as a Telco, we put more effort into designing Telco-specific tasks.
Telco-specific Tasks
We designed Telco-specific tasks to evaluate a model’s expertise and knowledge of the Telco domain. SKT’s team of linguistic experts designed each task meticulously to ensure quality and reproducibility. The data for the tasks comes from Telco service data, meaning that it reflects actual customer requests and inquiries. This means that the data is useful for evaluating model performance, but we can also glean insights into performance on relevant business cases as well.
Target Models
Base Models: In order to objectively examine the performance of the Telco LLM, the performance of general LLMs, such as GPT-4, must also be examined. Comparing the Telco LLM with general LLMs, allows us to determine how much fine-tuning the model improves performance as well as highlight areas for improvement.
Telco LLMs: There are currently two Telco LLMs: general LLMs that are fine-tuned with Telco-specific data. The two models are TelClaude, which is based on Anthropic's Claude model, and TelGPT, which is tuned on OpenAI's GPT model. The tuning process creates several candidate models with different data mixes and or model sizes, and SKT quickly evaluates these models to find the best performing model.
Benchmarking
Benchmarking has 3 steps:
Automated evaluation based on metrics like F-1
Detailed analysis of results based on features, such as class or topic
Human evaluation
Customer service agents perform the human evaluation, giving us keen insights from domain experts. These insights help drive improvements of tuning and benchmarking data, which subsequently improve the next iterations of model tuning.
Detailed Result Analysis
As a result of bucketing issues with low scores during the automatic evaluation, we can discover several types of error cases. Here are a few examples from the intent evaluation results:
Case 1. Label: Cancellation.ServicePause → model prediction: Ask.ServicePause
['How to remove suspension of service', 'I want to know how to remove suspension', 'Tell me how to remove suspension']
The Telco LLM confused the above cases based on the similarity (in Korean) of pausing service and removing a suspension of service. Based on the frequency of the confusion, we determined that we needed to augment intent data further and provide more specific examples for each intent.
Case 2. Label: Check.RoamingPlan → model prediction: Ask.RoamingAnnouncement
['Roaming for a business trip to China', 'Roaming for a 3-day business trip']
This case is also confusing to the Telco LLM due to the similarity of each intent. Instead of predicting the correct ‘roaming fee info’ intent, the Telco LLM predicted the name of the roaming service instead. To resolve this issue, we decided that we needed to augment the intent data for each intent here as well.
Human Evaluation
Detailed analysis of automatic evaluation results provides great insight into patterns of errors. However, to improve the model on difficult cases and meet the expectations of end users, human evaluation is necessary. As such, after tuning the Telco LLM, we make sure to get human feedback for conversation-based tasks, such as summary, todo, and topic. In order to get the most relevant feedback, we’ve employed SKT customer service representatives, who are domain experts. Through the human feedback results, we are able to find areas to refine the data and aspects we should focus on in the next round of tuning. Here are some examples where the representatives provided interesting feedback.
Todos
Representatives take notes on actions that cannot be handled during the call and have to be handled on call completion or at a later date.
The main objective is for the Telco LLM to learn to take such notes, so that the representative does not have to do so manually. Such actions are typically delineated by the future tense, but this is not always the case. From feedback from the representatives, we have been able to determine the desired form and information for relevant action items in conversations. These include:
MMS : It is required to send an MMS message to process service applications, since the application includes documents and the application URL
Call - It is necessary to make a call :
- To get consent from a legal representative, a card holder, or the service owner
- To confirm additional policy-related information
- To transfer the customer to another department / representative
The main type of todo is sending a text message (or MMS), but thanks for feedback from customer support agents, we also learned about the cases requiring a phone call. This helped us augment our tuning data set and improve overall performance.
Example
[Counselor]: Hello. This is [Jang Yoon-ji], the person in charge of changing the name of SK Telecom. [Kim Geon-woo] Are you the customer [Customer]: Yes. [Counselor]: Your spouse [Oh Min-jeong] requested to change the name of the Internet and landline phone used in your name. Do you agree to proceed? [Customer]: Yes, I agree. [Counselor]: Please tell us your date of birth to verify your identity. [Customer]: [February 12, 1995]. [Counselor]: Thank you. Since you sent me your ID card, I will skip authentication. Due to a change in name between family members, the fees up to the previous day will be carried over as is. For TBs that are combined as a family due to a change in name, we will briefly release the free service for the entire family, change the name, and then combine them again. In the unlikely event of non-payment of the current month's bill, the combined bill will be billed in the following month. I will call [Oh Min-jeong] and once the name change has been completed, I will send a completion text message to [Kim Geon-woo]'s number. [Customer]: Okay. [Counselor]: Stay healthy. It was [Jang Yoon-ji]. thank you |
---|
todo - Process the name change after calling another customer - Send text message: Send a message to the customer indicating the name change has been completed. |
Topics
A good topic is relevant to the conversation and is representative of the key content in the conversation. As such, we designed the topic tuning data and benchmark data to have clear and concise topics from contact center conversations and focusing on technical terms that are relevant to the Telco domain. This was mostly a fine design choice, but the representatives mentioned that they want to get an overarching feel of the conversation from the topics and requested topics related to general Telco services.
In conversations, customers often don’t know the exact name of plans or services and refer to plans or services with specific features, such as price (e.g., 69,000 won plan) or amount of data (1GB plan). The contact center agents requested the inclusion of such topics, as they felt they could guess the product or service name from the specific figures.
The representatives also requested topics related to business processes, as they felt that such topics would help them categorize the type of call. These topics include “actions” like “inquiry” or “confirmation.”
[Counselor]: Nice to meet you. If you have any questions, please let us know. My name is Jeong Hyeon. [Customer]: Hello. I would like to apply for roaming. [Counselor]: What country are you going to and what is your overall itinerary? [Customer]: I'm going to China. I'm taking a flight at 16:00 today and returning on Wednesday. [Counselor]: Then, will you also use data such as KakaoTalk? [Customer]: Yes, that's right. [Counselor]: The data plan is available for a basic fee of 9,900 won per day. [Customer]: Yes, I understand. [Counselor]: What time do you need to use it on Wednesday? [Customer]: I think I will use it until 9 AM on Wednesday. [Counselor]: Then, you can use it for two days, from 4 PM today to 4 PM Wednesday. Then, you can use it for a total of 19,800 won, and you can quickly use up to 300 megabytes of data per day. After that, speeds will be slower, but will still be available. [Customer]: Yes, I understand. [Counselor]: Then, I will apply for the 9,900 won two-day plan for a total of 19,800 won. It will be held for two days starting at 4 PM today in Korean time. You will be charged the full amount for one-time data use, and text messages will be available for free after starting your plan. If you use the T phone app we provide, making or receiving calls to and from Korea is free. Please note that when making or receiving Korean calls, the word 'Baro' must change to blue to use free calls. [Customer]: Yes, I understand. [Counselor]: I sent you a text message with related information. Is there anything else you need? [Customer]: None. It can be used for two days until 4pm on Wednesday, costs 19,800 won, 300 MB per day, and the speed slows down after that. thank you [Counselor]: Yes, that’s right. Please note that it closes at 4pm Korean time. thank you I was Jeong Hyeon. |
---|
topic - Roaming, Singapore roaming, roaming data plan application, 9,900 won roaming plan application |
Next Steps
The Telco LLM is in the early stages of development, so we have focused on tuning and benchmarking tasks that reflect Telco domain knowledge and Telco business practices. Going forward, we will continue to tune and benchmark the model to spur further progress. Human evaluation also provides invaluable insights that help us improve both tuning and benchmark data. By repeatedly performing quantitative and qualitative benchmarks, the Telco LLM will continue to improve by leaps and bounds.
In the next round of benchmarking, we plan to introduce new evaluation tasks, including RAG and Planning. Both of these capabilities are key to providing excellent customer service and resolving customer issues.
We also plan to look at the benchmarking process in more detail. The number of shots (5) is a bit arbitrary, and the method of “shot” selection can also be improved. The focus should not be on optimizing the performance of base models, but it’s important to establish strong baselines.
Data is the life blood of LLMs, and taking a data centric approach enables us to iteratively improve and tweak performance. However, without proper benchmarks and metrics, it is impossible to determine if the model is improving in meaningful ways. TelBench is SKT’s answer to a benchmark set for the Telco industry.