Skip to content

Counting numbers on big files that changes once per day but the check is done multiple times per hour?

Taking the following as a story that exists.

A monitoring tool is parsing a big file every x minutes to count the amount of a word inside. If the counted amount is less than a threshold limit, counter measurements are triggered. This big file is created once per day.

This is working fine if the file is small. Now think about xml files with a size of gigabytes and this files are on a network storage and you have many of them to monitor.

My solution for this problem is to create a caching layer. The only thing we need to solve is to detect if the big file has changed.

Using sha256sum on big files takes time. Using md5sum on big files takes time. There is one thing that is fast and good enough to be unique.

ls -l jobs_with_channels_multiple_location_nodes.xml | awk '{print $5$6$7$8}' 

Where to store the cache? Just use the name of the big file and add a ".cache" to the name.

What should be in the cache file? Only two lines. First line is the cache key, second line is the cached count value.

And the logic?

# Counts amount of <foo> nodes in provided file.
# To speed up things, we create a cache file.
# This is just an logic example file. If you are using it on production, good luck!
# @since 2019-03-06
# @author stev leibelt <>


if [[ ! -f "${SOURCE_FILE_PATH}" ]];
    echo ":: Invalid argument provided."
    echo "   Provided file path >>${SOURCE_FILE_PATH}<< does not exist."

    exit 1;

SOURCE_CACHE_KEY=$(ls -l "${SOURCE_FILE_PATH}" | awk '{print $5$6$7$8}' )

if [[ -f "${CACHE_FILE_PATH}" ]];
    CACHE_KEY=$(head -n1 "${CACHE_FILE_PATH}")

if [[ "${CACHE_KEY}" == "${SOURCE_CACHE_KEY}" ]];
    COUNT=$(tail -n1 "${CACHE_FILE_PATH}")
    COUNT=$(cat "${SOURCE_FILE_PATH}" | grep -c '<foo>')
    cat >"${CACHE_FILE_PATH}"<<DELIM

echo ${COUNT}

Hope this helps.


No Trackbacks


Display comments as Linear | Threaded

No comments

The author does not allow comments to this entry

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
To leave a comment you must approve it via e-mail, which will be sent to your address after submission.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.

Markdown format allowed
Form options