Skip to content

A way to deal with Schei* encoding - deal with "Non-ISO extended-ASCII"

We had, again, some issues with encoding.
*file* returns an output like "Non-ISO extended-ASCII". This time, I created a basic step sequence here.
At the end, it really is an brute force approach. And we are using heavily a lot of open source software (thanks again duds!). Furthermore, the sequence steps are based on this post from superuser.com.

# create a list with supported encodings
iconv --list | sed 's/\/\/$//' | sort > list_with_supported_encodings.txt
# iterate over the list of know encodings and try to encode the file with it
LOCAL_SUPPORTED_ENCODING_FILE_PATH='list_with_supported_encodings.txt'
LOCAL_RESULT_FILE_PATH='result.txt'

for LOCAL_ENCODING in `cat $LOCAL_SUPPORTED_ENCODING_FILE_PATH`; do
    printf "$LOCAL_ENCODING  "
    iconv -f $LOCAL_ENCODING -t UTF-8 2016-02-08_UPLOAD_CSV.csv.stev > /dev/null 2>&1 && echo "ok: $LOCAL_ENCODING" || echo "fail: $LOCAL_ENCODING"
# uncomment line below if you want to see the result and put it into the file
#done | tee $LOCAL_RESULT_FILE_PATH
# put the output into the file
done | cat > $LOCAL_RESULT_FILE_PATH
# filter only the successful tryouts
LOCAL_RESULT_FILE_PATH='result.txt'

cat $LOCAL_RESULT_FILE_PATH | grep 'ok:' > 'only_ok_'$LOCAL_RESULT_FILE_PATH
Now comes the hard work, you have to give it a try for each "ok" result in the fitting file.
# read the result file with the ok content and create a encoded version of your broken file
LOCAL_BROKEN_FILE_PATH='relative/or/full/qualified/file/name.txt'
LOCAL_RESULT_FILE_PATH='only_ok_result.txt'

# sed -e 's/^\(.*\)\ \ ok\(.*\)/\1/p' means
# remove any kind of content starting with '  ok:' on each line
# assmed a line looks like "S2  ok: WS2", the result will look like "WS2"

for LOCAL_ENCODING in `cat $LOCAL_RESULT_FILE_PATH | grep ok | sed -e 's/^\(.*\)\ \ ok\(.*\)/\1/p' | uniq; do
    LOCAL_CONVERTED_FILE_PATH=$LOCAL_ENCODING'_'$LOCAL_BROKEN_FILE_PATH
    #echo $LOCAL_CONVERTED_FILE_PATH
    iconv -f CP850 -t UTF-8 $LOCAL_BROKEN_FILE_PATH > $LOCAL_CONVERTED_FILE_PATH
done
Open each file and check if your fitting special characters are looking good. "WINDOWS-1258" and "CP850" are good blind guesses here.

Trackbacks

Keine Trackbacks

Kommentare

Ansicht der Kommentare: Linear | Verschachtelt

Noch keine Kommentare

Die Kommentarfunktion wurde vom Besitzer dieses Blogs in diesem Eintrag deaktiviert.

Kommentar schreiben

Die angegebene E-Mail-Adresse wird nicht dargestellt, sondern nur für eventuelle Benachrichtigungen verwendet.
Um einen Kommentar hinterlassen zu können, erhalten Sie nach dem Kommentieren eine E-Mail mit Aktivierungslink an ihre angegebene Adresse.

Um maschinelle und automatische Übertragung von Spamkommentaren zu verhindern, bitte die Zeichenfolge im dargestellten Bild in der Eingabemaske eintragen. Nur wenn die Zeichenfolge richtig eingegeben wurde, kann der Kommentar angenommen werden. Bitte beachten Sie, dass Ihr Browser Cookies unterstützen muss, um dieses Verfahren anzuwenden.
CAPTCHA

Markdown-Formatierung erlaubt
Formular-Optionen