Manipulating dbt Files with Shell Scripts

If you’re working with a bunch of sources within your dbt project, maintaining a single-source file (or even a couple) can quickly become cumbersome to manage, especially if you have multiple people editing and committing changes to the same file.

It is much better to have a single file per source table; this is easier to navigate and track changes with distributed teams. In pursuit of that, I’ve created a separate folder in the root dbt project directory called Sources, under which there are subfolders representing database and schema. So if I have a source table, raw.data_lake.table1, in my database, my dbt project would have both folders raw and data_lake, underneath which table1.yml would live:

```
./sources
./sources/raw
./sources/raw/data_lake
./sources/raw/data_lake/table1.yml
```

Picked up a concept from this blog post from Shopify engineering, although above is a little more “verbose” version of what these guys mention in that post.

I could, of course, do everything by hand manually—write each yml separately—but my attention span is (at a push) five files. After that, my brain starts wondering what’s for dinner, whether landscaping work in my garden ever going to end (or begin), and mistakes creep in.

So, here’s a semi-manual solution to this very, very niche problem. It’s semi-manual as you still need to copy-paste and press a few return keys after all.

Prerequisites

This blog assumes some familiarity with dbt. You’re welcome to run with this blog regardless, of course, but I’d also suggest checking out dbt’s excellent online courses page, where you’ll find a much more methodical introduction to its concepts.

I’m also going to heavily lean on UNIX utilities to help me bulk-edit files, so you may want to read up on their usage separately, but there is a glossary for each switch used in this blog that hopefully should explain their usage enough. The code below was tested with zsh on MacOS, but where possible I used the POSIX-compliant syntax, so you should find it works verbatim on WSL, Ubuntu or any other UNIX-like system that complies with POSIX. YMMV if you use alternative shell/OS.

sed: stands for stream editor. This utility takes either stdin or file reference and processes the contents as requested by the user. Think of it as find-and-replace on steroids.
find: like ls on steroids (bit of a pattern here) when it comes to recursively searching current directory for things. It takes up slightly more system resources, but it allows for a few useful switches (notably -exec) to make it worthwhile. We could have used ls, but it can be slower sometimes. Also, there’s this argument against doing that.
csplit: splits files into sections determined by a pattern then writes those sections individually into new, separate files.

To make use of the dbt macro, you’ll need to add a new file in your dbt project directory. Call it packages.yml, the contents of which should read something like this:

```yml
packages:
  - package: dbt-labs/codegen
    version: 0.5.0
```

Version 0.5.0 was the latest at the time of writing, but make sure you reference the latest stable version available. After that, run dbt deps to pull and install dependencies (codegen in this case), and you’re good to go for the next section.

Use dbt to Generate sources.yml

Using a pre-built macro from dbt-labs, generate the yml file for the source schema. Within DBT cloud, run the following:

```shell
$ dbt run-operation generate_source --args '{"schema_name": "data_lake", "database_name": "raw", "generate_columns": true, "include_descriptions": true}'
```

This will generate output (click on view logs). Note the extraneous time printed in the log as well.

```yml
12:30:14  version: 2

sources:
  - name: schema_name
    database: database_name
    tables:
      - name: table_name1
        columns:
          - name: col1
            description: ""
          - name: col2
            description: ""
          - name: <...>
      - name: table_name2 
        columns:
          - name: <...>
```

Paste the output starting from line 7 into a separate file called, for example, sources.yml. The file should look something like this (indentation is important):

```yml
      - name: table_name1
        columns:
          - name: col1
            description: ""
          - name: col2
            description: ""
          - name: <...>
      - name: table_name2 
        columns:
          - name: <...>
```

Use SQL to Generate sources.yml

Of course, you could get the exact same information from information_schema and use SQL text functions to print out the valid yml. Let’s do that:

```sql
select
  concat(
    , '\n  - name: ', lower(table_schema)
    , '\n    database: ', lower(table_catalog)
    , '\n    description: ""'
    , '\n    tables: '
    , '\n      - name: ', lower(table_name)
    , '\n        columns: '
    , '\n'
  ) 
  || listagg(
    concat(
     '          - name: ', column_name, '\n'
      , '            description: ""', '\n'
      , '            meta: ', '\n'
      , '              pii_type: '
    )
    , '\n')
  || '\n' as ""
from RAW.information_schema.columns
where 
  concat_ws('.', table_catalog, table_schema) 
    = 'DATABASE.SCHEMA'
group by 
  table_catalog
  , table_schema
  , table_name

```

Each row in the output would be identical to the output from DBT above.

concat() and concat_ws() functions will concatenate disparate strings together (in the latter function’s case, we’re also using a dot-separator), while listagg() will collapse strings across different rows into a single row.

For example, listagg() will take raw output from information_schema.columns (abridged):

| ROW | table_name | column_name |
| --- | ---------- | ----------- |
| 1   | TABLE1     | COLUMN1     |
| 2   | TABLE1     | COLUMN2     |
| 3   | TABLE2     | COLUMN3     |
| 4   | TABLE2     | COLUMN4     |
| 5   | TABLE2     | COLUMN5     |
| 6   | TABLE3     | COLUMN6     |
| 7   | TABLE3     | COLUMN7     |

And if you run the following,

``` sql
select listagg(column_name, '\n') 
from information_schema.columns
group by table_name
```

it would return:

| ROW | listagg(column_name, ', ') |
| --- | -------------------------- |
| 1   | COLUMN1, COLUMN2           |
| 2   | COLUMN3, COLUMN4, COLUMN5  |
| 3   | COLUMN6, COLUMN7           |

The rest of the sql statements are just dressing up the above to make it a valid yaml file. See docs pages on concat(), contact_ws() and listagg() for more info.

Optional: Add Custom yaml Keys

If you used the dbt macro to create the yml file, you might want to add a few extra keys. If you need placeholder space for metadata on each of the columns, now’s the time to do it. I’ll add a key for tests and meta keys using the script below:

```bash
sed -i 's/description: \"\"/description: "" \
            meta: \
              pii_type: \
            tests:/' sources.yml
# where
#   -i: 
#		    modify in place (overwrite original file)
#   's/pattern/replacement/options':
#	   		"s" stands for substitute. This runs a regex 
#       find-replace on pattern-replacement 
#       (options are... optional)
#   \ (at the end of the line)
#       line continuation indicating that the 
#       command continues on the next line
```

This will look for an exact string description: “” and replace it with description + new keys. I’m using description: “” here as that is a repeated string for each of the columns within my tables. The output of the code above should look something like this:

```
      - name: table_name1
        columns:
          - name: col1
            description: ""
            meta:
              pii_type:
            tests:
          - name: col2
            description: ""
            meta:
              pii_type:
            tests:
          - name: <...>
      - name: table_name2 
        columns:
          - name: <...>
```

Split Single yaml File into Set of Separate Files

Given the input file sources.yml containing output from the previous section, use csplit to write separate files with the moniker _src\d{2}.yml, treat the empty line as an indicator of the new file / file separator:

```bash
$ csplit --quiet --prefix=_src --suffix-format=%02d.yml --suppress-matched sources.yml /^$/ {*}

# where
#	  --quiet: 
#			do not print counts of output file sizes
#	  --prefix: 
#			use PREFIX instead of 'xx'
#	  --suffix-format: 
#			use sprintf FORMAT instead of %02
#  	--suppress-matched: 
#			suppress the lines matching PATTERN
#	  /^$/: 
#			regex pattern to split the files by
#	  {*}: 
#			repeat the previous pattern as many times as possible
```

See the rest of this excellent man page if you want to learn more. The code above was shamelessly stolen verbatim from this StackOverflow answer. There shouldn’t be any output in the stdout, but you will now notice a bunch of new files—something like this:

Image may be NSFW.
Clik here to view.

Insert Missing yaml Keys at Each File’s Start

Given a set of files in the current directory matching the pattern _src.*.yml, insert a multi-line string at the beginning of the file:

```bash
$ sed -i '1i version: 2 \
sources: \
  - name: schema_name \
    database: database_name \
    tables:' _src*
```

The code was adopted from this StackOverflow answer. Make sure to replace schema_name and database_name with the appropriate schema and database names. Each file should now be properly formatted yaml:

```yml
version: 2

sources:
  - name: schema_name
    database: database_name
    tables:
      - name: table_name1
        columns:
          - name: col1
            description: ""
          - name: col2
            description: ""
          - name: <...>
      - name: table_name2 
        columns:
          - name: <...>
```

Rename Each yaml File to Match Targeted Table Name

Given a set of files in the current directory matching the pattern _src.*.yml, find the sixth line within each file. Extract and store text matching the pattern `*- name: (*)` from the contents of that line. Construct a rename statement using the captured contents (removing unnecessary characters and lower-casing the string just in case):

```bash
$ find . -maxdepth 1 -regex '.*_src.*yml' -exec \
  sed -n "6 s|.*- name: \(.*\)|mv '{}' \L\1.yml|p" '{}' \
  \;

#  where
#    sed -n:
#        commonly used with a `p` option 
#        tells sed to only print the lines you ask it to print
#        default is to print everything. 
#        See https://superuser.com/a/852491 for more info.
#    find -maxdepth N`:
#        defines maximum subdirectory depth that `find` descends to. 
#        So a value of `2` would mean find does not go further down beyond 1 subfolder.
#    find -regex '.*'`: 
#        this allows you to filter out files based on regex pattern. 
#        Pattern of `'.*\.csv'` would only grab files with extension of `.csv`
#    find -exec <command> {}`:
#        this allows us to execute another command for each of the files found. 
#        Each file found using `find` utility is represented by `{}` here.
```

This should output (in the stdout) commands like the below:

```bash
$ mv './_src00.yml' 'name-from-_src00.yml-line-6.yml'
$ mv './_src01.yml' 'name-from-_src01.yml-line-6.yml'
...
$ mv './_srcNN.yml' 'name-from-_srcNN.yml-line-6.yml'
```

Finally, if you add | sh at the end of that code above, it would execute the rename commands (`mv`) in order, so the full script should look like this:

```bash
$ find . -maxdepth 1 -regex '.*_src.*yml' -exec \
  sed -n "6 s|.*- name: \(.*\)|mv '{}' \L\1.yml|p" '{}' \
  \; | sh
```

This should then produce a set of new yaml files with their proper names. You can now put those files in the appropriate directory mentioned in the sources-paths config option within your dbt-project.yml, and dbt would pick those up on the next run. This is adopted from this StackOverflow answer.

Conclusion

And there you have it—some snippets of code you can use to manipulate the files generated through dbt, or sql. The beauty of having these scripts is that as soon as a new source is added, we can generate a new set of yml files pretty painlessly.

This is what our example folder structure for sources looks like at the end of all these steps:

Image may be NSFW.
Clik here to view.

The post Manipulating dbt Files with Shell Scripts appeared first on InterWorks.

Manipulating dbt Files with Shell Scripts

Prerequisites

Use dbt to Generate sources.yml

Use SQL to Generate sources.yml

Optional: Add Custom yaml Keys

Split Single yaml File into Set of Separate Files

Insert Missing yaml Keys at Each File’s Start

Rename Each yaml File to Match Targeted Table Name

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112