Skip to content

bash - compare big xml files - get differences

The following script provides a solution to compare two big xml files. I tried to compare a lot of xml files with a size of greater 500 megabytes with different tools. Each tool was eating up my memory and swap and finally crashed. All i want to have is "show me what is in file one and not in file two and vice versa". I've reached this goal by using a property my xml files have. Each file as nodes. Each node has a unique identifier inside. I cutting out the unique identifier tag and putting this tag, line by line, into a file. After that, i'm sorting this unique identifiers. Finally i am using diff. To create a more useful output, i'm separating the "what is only in file one" into a own file (and the same for file two).

Happy using and if you find errors, i'm ready to fix them :-).

#!/bin/bash
####
# script to compare two xml files by (unique) tag
####
# @author stev leibelt
# @since 2013-03-13
####

if [[ $# -eq 3 ]]; then
  XML_FILE_ONE="$1"
  XML_FILE_TWO="$2"
  XML_TAG="$3"

  if [[ -f "$XML_FILE_ONE"
        && -f "$XML_FILE_TWO"
        && ! -z "$XML_TAG" ]]; then
    #retrieving xml_tags per file
    #reduce xmls by lines containing the tag
    sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_ONE > $XML_FILE_ONE'.sed'
    sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_TWO > $XML_FILE_TWO'.sed'

    #sort and uniq the sed'ed files
    sort $XML_FILE_ONE'.sed' | uniq > $XML_FILE_ONE'.sort'
    sort $XML_FILE_TWO'.sed' | uniq > $XML_FILE_TWO'.sort'

    #output the differences
    diff $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
    #diff --side-by-side $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
    #comm -3 $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.comm'

    #show only differences per file
    sed -n -e 's/^<\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_ONE'.diff.uniq'
    sed -n -e 's/^>\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_TWO'.diff.uniq'

    #sed -n -e 's/^<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_ONE'.comm.uniq'
    #sed -n -e 's/\t<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_TWO'.comm.uniq'

    #removing unused files
    rm -fr $XML_FILE_ONE'.sed' $XML_FILE_TWO'.sed' $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort'
  else
    echo 'Invalid arguments provided'
    echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
  fi
else
  echo 'Invalid number of arguments provided'
  echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
fi

Available on github.com.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

No comments

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
To leave a comment you must approve it via e-mail, which will be sent to your address after submission.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

Markdown format allowed
Form options