Skip to content

Counting numbers on big files that changes once per day but the check is done multiple times per hour?

Taking the following as a story that exists.

A monitoring tool is parsing a big file every x minutes to count the amount of a word inside. If the counted amount is less than a threshold limit, counter measurements are triggered. This big file is created once per day.

This is working fine if the file is small. Now think about xml files with a size of gigabytes and this files are on a network storage and you have many of them to monitor.

My solution for this problem is to create a caching layer. The only thing we need to solve is to detect if the big file has changed.

Using sha256sum on big files takes time. Using md5sum on big files takes time. There is one thing that is fast and good enough to be unique.

ls -l jobs_with_channels_multiple_location_nodes.xml | awk '{print $5$6$7$8}' 

Where to store the cache? Just use the name of the big file and add a ".cache" to the name.

What should be in the cache file? Only two lines. First line is the cache key, second line is the cached count value.

And the logic?

#!/bin/bash
####
# Counts amount of <foo> nodes in provided file.
# To speed up things, we create a cache file.
#
# This is just an logic example file. If you are using it on production, good luck!
#####
# @since 2019-03-06
# @author stev leibelt <artodeto@bazzline.net>
####

SOURCE_FILE_PATH="${1}";
CACHE_FILE_PATH="${SOURCE_FILE_PATH}.count_cache"

if [[ ! -f "${SOURCE_FILE_PATH}" ]];
then
    echo ":: Invalid argument provided."
    echo "   Provided file path >>${SOURCE_FILE_PATH}<< does not exist."

    exit 1;
fi

SOURCE_CACHE_KEY=$(ls -l "${SOURCE_FILE_PATH}" | awk '{print $5$6$7$8}' )

if [[ -f "${CACHE_FILE_PATH}" ]];
then
    CACHE_KEY=$(head -n1 "${CACHE_FILE_PATH}")
else
    CACHE_KEY=""
fi

if [[ "${CACHE_KEY}" == "${SOURCE_CACHE_KEY}" ]];
then
    COUNT=$(tail -n1 "${CACHE_FILE_PATH}")
else
    COUNT=$(cat "${SOURCE_FILE_PATH}" | grep -c '<foo>')
    cat >"${CACHE_FILE_PATH}"<<DELIM
${SOURCE_CACHE_KEY}
${COUNT}
DELIM
fi

echo ${COUNT}

Hope this helps.

howto - speed up mysql constantantly with increasing tablesize

The startingpoint is the following. You have a table with increasing content. Due to this fact, somebody added a timestamp or a created_at column. Now a requirement pops up telling you "we have to get a quick overview of entries per day and we want to paginate over the results". First idea is add a index on the created_at column. But your table is growing and another index means also more workload for the db to write an entry. To implement the pagination, you implement the usage of a limit. But the limit doesn't really speeds up your pagination when you try to get results for the last pages. The problem for the limit is, that mysql needs also to go over the results you want to leave of (because of the offset you have to provide). So what to do? Assuming you have an unique key that is an autoincrement value, you can use this to reach your goals (and don't need another created_at column and write intensive index). But for the per day pagination you also need a little helper. One idea is to create a table that holds the start unique id for per day or you create a table per day or month. Since the second idea has too much drawbacks for me, i will explain the first one a bit more in detail. The idea is to create a table with columns `id`, `date`, `first_id_for_the_day`. Now you can easily deal with the question "what id's are entered per day". Now you just have to add to you query a "BETWEEN id_of_search_date AND (id_of_search_next_date - 1)". Of course, you still need the limit but the database now is using a smaller scope (limited by the between clause), so it shouldn't be that big deal for it :-).