Skip to content

bash - compare big xml files - get differences

The following script provides a solution to compare two big xml files. I tried to compare a lot of xml files with a size of greater 500 megabytes with different tools. Each tool was eating up my memory and swap and finally crashed. All i want to have is "show me what is in file one and not in file two and vice versa". I've reached this goal by using a property my xml files have. Each file as nodes. Each node has a unique identifier inside. I cutting out the unique identifier tag and putting this tag, line by line, into a file. After that, i'm sorting this unique identifiers. Finally i am using diff. To create a more useful output, i'm separating the "what is only in file one" into a own file (and the same for file two).

Happy using and if you find errors, i'm ready to fix them :-).

#!/bin/bash
####
# script to compare two xml files by (unique) tag
####
# @author stev leibelt
# @since 2013-03-13
####

if [[ $# -eq 3 ]]; then
  XML_FILE_ONE="$1"
  XML_FILE_TWO="$2"
  XML_TAG="$3"

  if [[ -f "$XML_FILE_ONE"
        && -f "$XML_FILE_TWO"
        && ! -z "$XML_TAG" ]]; then
    #retrieving xml_tags per file
    #reduce xmls by lines containing the tag
    sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_ONE > $XML_FILE_ONE'.sed'
    sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_TWO > $XML_FILE_TWO'.sed'

    #sort and uniq the sed'ed files
    sort $XML_FILE_ONE'.sed' | uniq > $XML_FILE_ONE'.sort'
    sort $XML_FILE_TWO'.sed' | uniq > $XML_FILE_TWO'.sort'

    #output the differences
    diff $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
    #diff --side-by-side $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
    #comm -3 $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.comm'

    #show only differences per file
    sed -n -e 's/^<\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_ONE'.diff.uniq'
    sed -n -e 's/^>\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_TWO'.diff.uniq'

    #sed -n -e 's/^<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_ONE'.comm.uniq'
    #sed -n -e 's/\t<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_TWO'.comm.uniq'

    #removing unused files
    rm -fr $XML_FILE_ONE'.sed' $XML_FILE_TWO'.sed' $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort'
  else
    echo 'Invalid arguments provided'
    echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
  fi
else
  echo 'Invalid number of arguments provided'
  echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
fi

Available on github.com.

bash - enhanced svn diff

Again i want to share a simple bash function for dealing with the svn diff. I was tired to add the file i want to compare multiple times so i wrote a small wrapper function. The syntax is the following. net_bazzline_svn_diff https://svn.mydomain.org/trunk https://svn.mydomain.org/branches/my_branch module/myModule/myFile.foo Enjoy it and be aware that there is no fancy validation logic inside.

#####
# Calles svn diff for two repositories.
# Call $repositoryUrlOne $repositoryUrlTwo $filePath
#
# @author stev leibelt
# @since 2013-01-30
####
function net_bazzline_svn_diff ()
{
  if [ $# -eq 3 ]; then
    svn diff "$1""/""$3" "$2""/""$3"
  else
    echo 'No valid arguments supplied'
  fi
}
Sourceode also available on github.com.

tool - alternative to winmerge for linux - make a visual diff of two files

If you are searching for an alternative of winmerge that works quit fine on linux and other operation systems, take a look to meld. This tool provides you the ability to make a well visual diff from two files (preferred text files ;-)). It supports tabs, so you can diff multiple files. It is fast and it looks like the memory footprint is quite nice and small. By the way, merge also supports the main version control systems git, svn etc.