Skip to content

bash - compare big xml files - get differences

The following script provides a solution to compare two big xml files. I tried to compare a lot of xml files with a size of greater 500 megabytes with different tools. Each tool was eating up my memory and swap and finally crashed. All i want to have is "show me what is in file one and not in file two and vice versa". I've reached this goal by using a property my xml files have. Each file as nodes. Each node has a unique identifier inside. I cutting out the unique identifier tag and putting this tag, line by line, into a file. After that, i'm sorting this unique identifiers. Finally i am using diff. To create a more useful output, i'm separating the "what is only in file one" into a own file (and the same for file two).

Happy using and if you find errors, i'm ready to fix them :-).

#!/bin/bash
####
# script to compare two xml files by (unique) tag
####
# @author stev leibelt
# @since 2013-03-13
####

if [[ $# -eq 3 ]]; then
  XML_FILE_ONE="$1"
  XML_FILE_TWO="$2"
  XML_TAG="$3"

  if [[ -f "$XML_FILE_ONE"
        && -f "$XML_FILE_TWO"
        && ! -z "$XML_TAG" ]]; then
    #retrieving xml_tags per file
    #reduce xmls by lines containing the tag
    sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_ONE > $XML_FILE_ONE'.sed'
    sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_TWO > $XML_FILE_TWO'.sed'

    #sort and uniq the sed'ed files
    sort $XML_FILE_ONE'.sed' | uniq > $XML_FILE_ONE'.sort'
    sort $XML_FILE_TWO'.sed' | uniq > $XML_FILE_TWO'.sort'

    #output the differences
    diff $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
    #diff --side-by-side $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
    #comm -3 $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.comm'

    #show only differences per file
    sed -n -e 's/^<\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_ONE'.diff.uniq'
    sed -n -e 's/^>\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_TWO'.diff.uniq'

    #sed -n -e 's/^<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_ONE'.comm.uniq'
    #sed -n -e 's/\t<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_TWO'.comm.uniq'

    #removing unused files
    rm -fr $XML_FILE_ONE'.sed' $XML_FILE_TWO'.sed' $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort'
  else
    echo 'Invalid arguments provided'
    echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
  fi
else
  echo 'Invalid number of arguments provided'
  echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
fi

Available on github.com.

Trackbacks

Keine Trackbacks

Kommentare

Ansicht der Kommentare: Linear | Verschachtelt

Noch keine Kommentare

Kommentar schreiben

Die angegebene E-Mail-Adresse wird nicht dargestellt, sondern nur für eventuelle Benachrichtigungen verwendet.
Um einen Kommentar hinterlassen zu können, erhalten Sie nach dem Kommentieren eine E-Mail mit Aktivierungslink an ihre angegebene Adresse.

Um maschinelle und automatische Übertragung von Spamkommentaren zu verhindern, bitte die Zeichenfolge im dargestellten Bild in der Eingabemaske eintragen. Nur wenn die Zeichenfolge richtig eingegeben wurde, kann der Kommentar angenommen werden. Bitte beachten Sie, dass Ihr Browser Cookies unterstützen muss, um dieses Verfahren anzuwenden.
CAPTCHA

Markdown-Formatierung erlaubt
Formular-Optionen