Skip to main content

Using BigQuery ML to categorise email inquiries

· One min read

BigQuery ML

Problem description

We want to categorize emails in order to either generate automatic responses, or decrease the amount of manual labor needed.

The Data

We have extracted and anonymized actual inquiries received from Enturs Kundesenter, and their corresponding category.

The model

To create (and train) the model we simply run:

CREATE OR REPLACE MODEL
`entur-analytics-rtd.hackathon_2023_q1_lag4.auto_ml_all_top4` OPTIONS (
model_type='AUTOML_CLASSIFIER',
input_label_cols=['label'],
budget_hours=1
) AS
SELECT
ML.ngrams(REGEXP_EXTRACT_ALL(LOWER(Email), '[a-zæøå]+')) AS words,
operation AS label
FROM
`entur-analytics-rtd.hackathon_2023_q1_lag4.emails_top4`
WHERE
operation IS NOT NULL

where we specify the column(s) we want to predict. This query creates a BigQuery ML model that we can query to obtain predictions on new data. We could for instance predict the category for the email: "Jeg ønsker å avbestille min billett fra Oslo til Bergen" in the following manner:

select * from ML.PREDICT(MODEL `entur-analytics-rtd.hackathon_2023_q1_lag4.logreg_train_top4`, 
(
select ML.NGRAMS(REGEXP_EXTRACT_ALL(LOWER("Jeg ønsker å avbestille min billett fra Oslo til Bergen"), '[a-zæøå]+'), [1, 2]) as words
)
)

which returns the likelihood of each category from the model. In this case the model correctly predicts the category "avbestilling/refusjon" with a 40 % likelihood.

Kafka migration to VPC networks

· 3 min read

Virtual Private Cloud (VPC) peering is a method of connecting separate cloud(AWS, Google Cloud, or Azure) private networks with each other. It allows virtual machines in different private networks to talk to each other directly without going through the public Internet.

Aiven's VPC peering allows only private networks in the same cloud provider to talk to each other without going through the public internet. Which means that our Azure cloud users can only access Kafka services that are migrated to Google VPC network via only public URLs

Kafka platform changes

  1. Team Data Platform and Team Platform have created the necessary VPC resources to migrate existing internal Kafka clusters to VPC networks.

  2. Kafka clusters serving external Kafka users are NOT migrated and their usage remains same as before

    1. entur-kafka-test-ext
    2. entur-kafka-prod-ext
  3. This migration has no impact on all Entur applications running in GKE clusters in dev, staging and production environments

  4. Following are the clusters that will be migrated.

    1. entur-kafka-test-int
    2. entur-kafka-prod-int
  5. Following are the Kafka users affected by this migration. Public URLs are created with a public- prefix to the existing bootstrap and schema registry server URLs for these users.

    1. Entur applications running in other cloud networks like Azure Cloud etc
    2. CI/CD applications
    3. Entur developers

    ➡️ Switching to public URLs is mandatory after switching to VPC networks for the above users as the old/existing URLs are assigned to the private networks

  6. The same Kafka user credentials should work as before

Kafka services URL lookup

Cluster IDBootstrap Server URLSchema Registry Server URLAccessVPC
entur-kafka-test-extentur-kafka-test-ext-entur-test.aivencloud.com:11877https://entur-kafka-test-ext-entur-test.aivencloud.com:11867publicno
entur-kafka-prod-extentur-kafka-prod-ext-entur-prod.aivencloud.com:14019https://entur-kafka-prod-ext-entur-prod.aivencloud.com:14009publicno
entur-kafka-test-intentur-kafka-test-int-entur-test.aivencloud.com:11877https://entur-kafka-test-int-entur-test.aivencloud.com:11867privateyes
entur-kafka-test-intpublic-entur-kafka-test-int-entur-test.aivencloud.com:11878https://public-entur-kafka-test-int-entur-test.aivencloud.com:11867publicyes
entur-kafka-prod-intentur-kafka-prod-int-entur-prod.aivencloud.com:14019https://entur-kafka-prod-int-entur-prod.aivencloud.com:14009privateyes
entur-kafka-prod-intpublic-entur-kafka-prod-int-entur-prod.aivencloud.com:14020https://public-entur-kafka-prod-int-entur-prod.aivencloud.com:14009publicyes

Changes in usage with entur-kafka-spring-starter library

There is no change in default configuration as long as your application is accessing

  1. External Kafka clusters that are not migrated to VPC networks
  2. Internal Kafka clusters accessing from Entur's GKE clusters

Here is an example configuration change for accessing entur-kafka-test-int cluster where one has to use public URLs with entur-kafka-spring-starter library.

entur:
kafka:
bootstrapServer: "public-entur-kafka-test-int-entur-test.aivencloud.com:11878"
schemaRegistryUrl: "https://public-entur-kafka-test-int-entur-test.aivencloud.com:11867"
schemaRegistryBasicAuth: "${KAFKA_USER_NAME}:${KAFKA_USER_PASSWORD}"
sasl:
username: "${KAFKA_USER_NAME}"
password: "${KAFKA_USER_PASSWORD}"

Migration strategy

  1. All Kafka applications accessing from outside Entur's GKE clusters can start using public URLs
    1. Azure Cloud users
    2. CI/CD pipeline users
    3. Entur developers
  2. entur-kafka-test-int cluster will be migrated to VPC network
  3. entur-kafka-prod-int cluster will be migrated to VPC network in the end.