Skip to content

bash - compare big xml files - get differences

The following script provides a solution to compare two big xml files. I tried to compare a lot of xml files with a size of greater 500 megabytes with different tools. Each tool was eating up my memory and swap and finally crashed. All i want to have is "show me what is in file one and not in file two and vice versa". I've reached this goal by using a property my xml files have. Each file as nodes. Each node has a unique identifier inside. I cutting out the unique identifier tag and putting this tag, line by line, into a file. After that, i'm sorting this unique identifiers. Finally i am using diff. To create a more useful output, i'm separating the "what is only in file one" into a own file (and the same for file two).

Happy using and if you find errors, i'm ready to fix them :-).

#!/bin/bash
####
# script to compare two xml files by (unique) tag
####
# @author stev leibelt
# @since 2013-03-13
####

if [[ $# -eq 3 ]]; then
  XML_FILE_ONE="$1"
  XML_FILE_TWO="$2"
  XML_TAG="$3"

  if [[ -f "$XML_FILE_ONE"
        && -f "$XML_FILE_TWO"
        && ! -z "$XML_TAG" ]]; then
    #retrieving xml_tags per file
    #reduce xmls by lines containing the tag
    sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_ONE > $XML_FILE_ONE'.sed'
    sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_TWO > $XML_FILE_TWO'.sed'

    #sort and uniq the sed'ed files
    sort $XML_FILE_ONE'.sed' | uniq > $XML_FILE_ONE'.sort'
    sort $XML_FILE_TWO'.sed' | uniq > $XML_FILE_TWO'.sort'

    #output the differences
    diff $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
    #diff --side-by-side $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
    #comm -3 $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.comm'

    #show only differences per file
    sed -n -e 's/^<\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_ONE'.diff.uniq'
    sed -n -e 's/^>\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_TWO'.diff.uniq'

    #sed -n -e 's/^<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_ONE'.comm.uniq'
    #sed -n -e 's/\t<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_TWO'.comm.uniq'

    #removing unused files
    rm -fr $XML_FILE_ONE'.sed' $XML_FILE_TWO'.sed' $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort'
  else
    echo 'Invalid arguments provided'
    echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
  fi
else
  echo 'Invalid number of arguments provided'
  echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
fi

Available on github.com.

howto - sed - work with xml files - get content inside one tag

Assuming you have a large xml file (say 400 megabytes) and you want to grep the content inside one tag, which tool would solve this better then sed?

sed -n -e 's/.*\(.*\)<\/my_magicTag>.*/\1/p' myInputFile.xml > myInputFileFilteredByMyMagicTag.xml
So what we are doing? We are telling sed to search for none or a lot of text before "", store none or a lot of text before "". With "\1", we are using the first remembered pattern (since we only use one "()", we only have one in this command). With "\p", we are telling sed to print this out. After that, as usual, we are using ">" to redirect the standard output into a file.