Skip to content

Counting numbers on big files that changes once per day but the check is done multiple times per hour?

Taking the following as a story that exists.

A monitoring tool is parsing a big file every x minutes to count the amount of a word inside. If the counted amount is less than a threshold limit, counter measurements are triggered. This big file is created once per day.

This is working fine if the file is small. Now think about xml files with a size of gigabytes and this files are on a network storage and you have many of them to monitor.

My solution for this problem is to create a caching layer. The only thing we need to solve is to detect if the big file has changed.

Using sha256sum on big files takes time. Using md5sum on big files takes time. There is one thing that is fast and good enough to be unique.

ls -l jobs_with_channels_multiple_location_nodes.xml | awk '{print $5$6$7$8}' 

Where to store the cache? Just use the name of the big file and add a ".cache" to the name.

What should be in the cache file? Only two lines. First line is the cache key, second line is the cached count value.

And the logic?

#!/bin/bash
####
# Counts amount of <foo> nodes in provided file.
# To speed up things, we create a cache file.
#
# This is just an logic example file. If you are using it on production, good luck!
#####
# @since 2019-03-06
# @author stev leibelt <artodeto@bazzline.net>
####

SOURCE_FILE_PATH="${1}";
CACHE_FILE_PATH="${SOURCE_FILE_PATH}.count_cache"

if [[ ! -f "${SOURCE_FILE_PATH}" ]];
then
    echo ":: Invalid argument provided."
    echo "   Provided file path >>${SOURCE_FILE_PATH}<< does not exist."

    exit 1;
fi

SOURCE_CACHE_KEY=$(ls -l "${SOURCE_FILE_PATH}" | awk '{print $5$6$7$8}' )

if [[ -f "${CACHE_FILE_PATH}" ]];
then
    CACHE_KEY=$(head -n1 "${CACHE_FILE_PATH}")
else
    CACHE_KEY=""
fi

if [[ "${CACHE_KEY}" == "${SOURCE_CACHE_KEY}" ]];
then
    COUNT=$(tail -n1 "${CACHE_FILE_PATH}")
else
    COUNT=$(cat "${SOURCE_FILE_PATH}" | grep -c '<foo>')
    cat >"${CACHE_FILE_PATH}"<<DELIM
${SOURCE_CACHE_KEY}
${COUNT}
DELIM
fi

echo ${COUNT}

Hope this helps.

Trackbacks

Keine Trackbacks

Kommentare

Ansicht der Kommentare: Linear | Verschachtelt

Noch keine Kommentare

Die Kommentarfunktion wurde vom Besitzer dieses Blogs in diesem Eintrag deaktiviert.

Kommentar schreiben

Die angegebene E-Mail-Adresse wird nicht dargestellt, sondern nur für eventuelle Benachrichtigungen verwendet.
Um einen Kommentar hinterlassen zu können, erhalten Sie nach dem Kommentieren eine E-Mail mit Aktivierungslink an ihre angegebene Adresse.

Um maschinelle und automatische Übertragung von Spamkommentaren zu verhindern, bitte die Zeichenfolge im dargestellten Bild in der Eingabemaske eintragen. Nur wenn die Zeichenfolge richtig eingegeben wurde, kann der Kommentar angenommen werden. Bitte beachten Sie, dass Ihr Browser Cookies unterstützen muss, um dieses Verfahren anzuwenden.
CAPTCHA

Markdown-Formatierung erlaubt
Formular-Optionen