Here’s a truth about strings:
Most of the time, we do not use them as is.
We always do something with it. We ensure that business rules are being followed.
Let’s change the capitalization, or let’s always check that there no excess spaces.
When we’re dealing with thousands, or even millions of rows, we do not change them one by one.
We will always process them in one go, or at least that’s the most logical way!
I wrote this article as a collection of the most common substring and string manipulations in Google BigQuery.
I also added functions that we may have used in other databases and adjusted them for Google BigQuery.
This article is full of images, practical examples, and explanations for each function, so you get the most value out of it, whether you are a beginner in SQL, or you are transitioning into Google BigQuery.
Related: Data Types in Google BigQuery
What is a Substring?
When it comes to analyzing a text, sometimes we only want a portion of it.
For example, if we have a full name that includes the first and last names, we can get the first names only.
If we have an email address, we can get the email domain only (the text after the @ symbol).
In SQL or Data Analysis, a getting the substring is the process of extracting, or getting a part of a string.
There are many ways to get a substring, depending on what we want to get.
Do we want the first 15 characters only, or the text after the dollar sign?
We explain how to achieve all of these through examples in this article.
BigQuery Substring Function
In BigQuery, we use the function
SUBSTR to get the substring.
SUBSTR (text, start_point) SUBSTR (text, start_point, length_of_substring)
There are two ways to extract a substring as we see above.
When counting characters in substrings. the starting point is always
BigQuery Substring Examples
Substring Formula #1
In the first formula, we can specify a starting point, and the substring function will get the text from that starting point all the way to end.
For example, this query tells us to get the substring from position 9 onwards.
This is best used for strings which have prefixes that we want to remove.
Example: If all your products have a fixed-length code at the beginning, and you want to remove all of them, you can use this function.
|Product Code||Removing the Parent Code|
Substring Formula #2
Alternatively, the second formula allows us to provide a length for the substring.
This query tells us to get the substring from position 9 onwards, but only for a maximum of 6 characters.
SUBSTR('firstname.lastname@example.org', 9, 6)
This works best for strings with fixed formats.
If we have customer orders with a
123–XXXXXX–0000 format as follows:
We can use this substring format to get the 6 characters, from the 9th character onwards.
BigQuery STRPOS to Find Character in String
STRPOS function tells us which point in the string can we find a specific character (or text).
If there are multiple occurrences of the text, BigQuery will give the position (or index) of the first occurrence.
If we cannot find the text, the result is
Let’s have a look at some sample texts, where we try to find the position (or index) or the period sign.
|Hello, everyone.||16 (we always include spaces)|
|www.yuichiotsuka.com||4 (the first time a period is seen)|
|Hello World!||0 (there is no period)|
We can try the examples above using this query:
SELECT STRPOS('Mr. Yuichi', '.') AS Example_1, STRPOS('Hello, everyone.', '.') AS Example_2, STRPOS('www.yuichiotsuka.com', '.') AS Example_3, STRPOS('Hello World!', '.') AS Example_4 ;
Result: 3, 16, 4, and 0
STRPOS is not limited to single characters. We can also use words. The result we get is the position at the start of the word.
SELECT STRPOS('www.yuichiotsuka.com', 'yuichi') AS Example_5 ;
STRPOS is case sensitive. We will get a different result if the cases do not match. Let’s look at this example.
SELECT STRPOS('www.yuichiotsuka.com', 'YUICHI') AS Example_6 ;
To make a case-sensitive version, we add the
UPPER function twice. Alternatively, we can also use
LOWER, and the results are the same.
SELECT STRPOS(UPPER('www.yuichiotsuka.com'), UPPER('YUICHI')) AS Example_7, STRPOS(LOWER('www.yuichiotsuka.com'), LOWER('YUICHI')) AS Example_8 ;
Result: 5 and 5
BigQuery Substring LEFT and RIGHT
In other databases, the
RIGHT functions give us the first few characters from the left or right, respectively.
Here is a sample of getting the three (3) leftmost characters.
Now, here is a sample of getting the four (4) rightmost characters.
In BigQuery, however, the
RIGHT functions are not supported.
Do not worry! We can update our queries to work on BigQuery by making these little changes:
|Original Function||BigQuery Function||Explanation|
|LEFT (text, N)||SUBSTR (text, 1, N)||The 1 is always fixed, and we add the original length N as the third parameter.|
|RIGHT (text, M)||SUBSTR (text, –M, M)||The second parameter is the negative of M. The third parameter is M.|
Here are some examples of getting the left and right values from my name. Feel free to try it out with your name as well.
|Original Function||BigQuery Function||Result|
|LEFT (‘yuichi’, 3)||SUBSTR (‘yuichi’, 1, 3)||yui|
|RIGHT (‘yuichi’, 4)||SUBSTR (‘yuichi’, –4, 4)||ichi|
BigQuery Substring Not Found Error
If you run a
SUBSTRING function in BigQuery, you may an error that tells you something like this:
Function not found: SUBSTRING at […]
This simply means that we need to change from
SUBSTR, which is the accepted function in BigQuery.
BigQuery Substring After Character
In some cases, we do not know the exact positions of the substrings.
Sometimes, we just want the email domains after the at sign (@), right?
In these cases, we can use a combination of
SUBSTR to get the substring right after a character.
Here is the general formula:
SUBSTR(original, STRPOS(original, character)+1)
Let’s look at this example, where we consider a few other email addresses with different lengths. Note that some of these are sample email addresses only.
SELECT SUBSTR('email@example.com', STRPOS('firstname.lastname@example.org', '@')+1) AS Example_9, SUBSTR('email@example.com', STRPOS('firstname.lastname@example.org', '@')+1) AS Example_10, SUBSTR('email@example.com', STRPOS('firstname.lastname@example.org', '@')+1) AS Example_11 ;
Results: yuichiotsuka.com, mail.com, and gmail.com
Why does it work?
First, let’s look at the STRPOS function,
STRPOS(original, character)+1. This finds the correct starting point of the substring, with a +1 to exclude the character.
We then use that number as our starting point in the SUBSTR function.
REVERSE in BigQuery
REVERSE function reads the text from the end to the start.
SELECT REVERSE('ABCDE') AS Example_12 ;
BigQuery Add Leading Zeroes Using LPAD
If we have some strings of numbers, and we want to add leading zeroes, we can use the BigQuery
LPAD(text, maximum_length, '0')
The function automatically computes for the exact number of zeroes to add.
Let’s say we have a number,
123456, and we want to add leading zeroes to ensure we have exactly 9 characters.
SELECT LPAD('123456', 9, '0') AS Example_13 ;
The function adds 3 leading zeroes, just enough to make a 9-digit text.
Sometimes we want to split a text with separators into multiple parts, and then work on each part individually.
For example, if we have a product code
12345678–335329–0000, we may want to split it into three parts by the hyphen. We can do this using the BigQuery
Warning: The position in
SAFE_OFFSET is zero-based, which means that the first part corresponds to 0, second is to 1, and so on.
As an example, let’s try out this sample query:
SELECT SPLIT('12345678-335329-0000', '-')[SAFE_OFFSET(0)] AS Example_14, SPLIT('12345678-335329-0000', '-')[SAFE_OFFSET(1)] AS Example_15, SPLIT('12345678-335329-0000', '-')[SAFE_OFFSET(2)] AS Example_16, SPLIT('12345678-335329-0000', '-')[SAFE_OFFSET(3)] AS Example_17 ;
Result: 12345678, 335329, 0000, and null
Note that since there are only three parts, and since we use a zero-based index, the last example returned
There’s a ton of ways to process strings!
We used a visual and example-based approach to learn common BigQuery String functions.
To become better in SQL, we do not need to memorize each function.
What is more important is to know what we need and find the right solution for it.
For the full documentation of String Functions, you can visit the official Google BigQuery page.
Do you have a problem with your strings? Leave a comment down below!