XuvTools is developed in a cooperative effort from:
  • Chair of Pattern Recognition and Image Processing [www]
  • Friedrich Miescher Institute for Biomedical Research [www]
  • Center for Biological Systems Analysis [www]
 

Accessing the Server Side Data Set Repository

File Formats and Folder Structure

The folder structure on in the data set repository is depicted here:

/data/xuvtools_img/BugListData    - data sets belonging to a trac bugreport
/data/xuvtools_img/FileFormats    - data sets and documents for file readers
/data/xuvtools_img/ReferenceData  - data sets that are publicly available for testing
/data/xuvtools_img/OriginalData   - all other data sets (originals) that are available

The files are stored in a specific format on the server, to reduce the likelyhood of corruption and to consume as few space as possible. The format is rar, created with the proprietary rar tools from http://www.rarlab.com/download.htm. The proprietary rar format was chosen because a) it supports recovery information, b) compresses well and c) is available for all platforms we support. To install rar on Windows, simply download and install the latest nagware version. To install rar on Debian/Ubuntu (Ubuntu needs the multiverse repository), do:

sudo aptitude install rar unrar

Useful rar flags explained

    a      compress file(s)
   -m5     use compression strength "5"
   -rr1    add 1 percent recovery information
    x      extract file(s)

rar examples

To compress a data set with recovery information, do:

rar a -m5 -rr1 "<filename>.rar" "<filename>"

To uncompress a data set x with recovery information to the current directory, do:

unrar x "<filename>.rar"

If you need to compress a directory of files, you can use a loop like the following:

find "${PWD}" -type f|grep -v ".rar\$\|.md5\$"|while read FILE ; do
    J=$(dirname "$FILE") ; K=$(basename "$FILE") ; cd "$J" && if ! test -f "$K.rar" ; then rar a -m5 -rr1 "$K.rar" "$K" || break ; fi
done

If you need to uncompress the full repository of files, you can use a loop like the following:

for DIR in BugListData FileFormats ReferenceData OriginalData ; do
    find "/data/xuvtools_img/$DIR" -type f -name \*\.rar|while read FILE ; do
        J=$(dirname "$FILE") ; K=$(basename "$FILE" ".rar") ; if ! test -f "$J/$K" ; then cd "$J" && unrar x -o- "$K.rar" ; fi
    done
done

md5sum examples

We also provide md5sums of the original file (not the compressed rar file), along with the compressed data set, in order to test the file after extraction.

To create a new md5 checksum file, pipe the result of the md5sum tool into a file:

md5sum --binary "<filename>" > "<filename>.md5"

To check an extracted file against its md5 checksum file, use the md5sum tool in check mode:

md5sum --check "<filename>.md5"
<filename>: OK

If you need to add md5sums to a directory of files, you can use a loop like the following:

find "${PWD}" -type f|grep -v ".rar\$\|.md5\$"|while read FILE ; do
    J=$(dirname "$FILE") ; K=$(basename "$FILE") ; cd "$J" && if ! test -f "$K.md5" ; then && md5sum --binary "$K" > "$K.md5" || break ; fi
done

FIXME add Windows instructions.

rar with md5sum combined examples

Here is a combined example that will create compressed rar archives and md5sum's for a directory of files. This example has better performance than the two individual ones after each other, because the files are likely cached from the filesystem cache:

find "${PWD}" -type f|while read FILE ; do
    J=$(dirname "$FILE") ; K=$(basename "$FILE") ; cd "$J" && rar a -m5 -rr1 "$K.rar" "$K" && md5sum --binary "$K" > "$K.md5" || break
done

Useful rsync flags explained

   --dry-run             don't change any files, just print what would be done
   --verbose             print what is being done
   --progress            show a progress meter for each transfer
   --archive             preserve permissions and times
   --recursive           go through subdirectories
   --links               transfer links
   --devices             transfer device nodes
   --specials            preserve special files
   --rsh='ssh -p 22022'  use the ssh port 22022
   --compress            compress before sending/receiving
   --include "P"         include all files matching pattern P
   --exclude "P"         exclude all files matching pattern P
   --prune-empty-dirs    do not create empty directories

Synchronize Data Set Repository: From Server to Local

Here is a call to rsync, that would download (incrementally) the parts of the repository BugListData that are missing locally.

mkdir -p /data/xuvtools_img
rsync --archive --verbose --progress --rsh='ssh -p 22022' \
    "xuvtools.org:/data/xuvtools_img/BugListData" \
    "/data/xuvtools_img/"

If you want to mirror/update all repositories, you can synchronize all above listed folders BugListData, FileFormats, ReferenceData and OriginalData individually:

mkdir -p /data/xuvtools_img
for DIR in BugListData FileFormats ReferenceData OriginalData ; do
  rsync --archive --verbose --progress --rsh='ssh -p 22022' \
      "xuvtools.org:/data/xuvtools_img/${DIR}" \
      "/data/xuvtools_img/" || \
      break
done

Of course, you can just use WinSCP, scp or unison as well.

FIXME Add unison instructions

Synchronize Data Set Repository: From Local to Server

Please upload only rar-compressed files with recovery information to the server.

rsync --recursive --links --devices --specials --verbose \
    --progress --rsh='ssh -p 22022' --prune-empty-dirs --include "*/" \
    --include "*.rar" --include "*.md5" --exclude "*" \
    "/data/xuvtools_img/FileFormats" \
    "xuvtools.org:/data/xuvtools_img/"
for DIR in BugListData FileFormats ReferenceData OriginalData ; do
  rsync --recursive --links --devices --specials --verbose \
      --progress --rsh='ssh -p 22022' --prune-empty-dirs --include "*/" \
      --include "*.rar" --include "*.md5" --exclude "*" \
      "/data/xuvtools_img/${DIR}" \
      "xuvtools.org:/data/xuvtools_img/" || \
      break
done

Of course, you can just use WinSCP, scp or unison as well.

Fixing Permissions

If you work on Unix, you might want to synchronize with the correct user names and permissions, so everyone accessing the server has correct user access rights. For that, it might be helpful to add a useraccount www-xuvtools on your local machine, and become part of its group:

sudo addgroup -gid 1020 www-xuvtools
sudo adduser --uid 1020 --gid 1020 www-xuvtools
sudo adduser ${USER} www-xuvtools
# you may need to log out, and back in, for the group addition to become effective

for DIR in BugListData FileFormats ReferenceData OriginalData ; do
  sudo mkdir -p "/data/xuvtools_img/${DIR}" && \
  sudo find "/data/xuvtools_img/${DIR}" -type d -exec chmod 770 {} \; && \
  sudo find "/data/xuvtools_img/${DIR}" -type f -exec chmod 660 {} \; && \
  sudo chown -R www-xuvtools:www-xuvtools "/data/xuvtools_img/${DIR}" || \
  break
done

Finding duplicate Datasets (based on MD5)

We sometimes find duplicate datasets in the upload folder. To remove duplicate entries, you can use the following script lines:

find BugListData FileFormats OriginalData ReferenceData -name \*md5 \
    |while read I ; do \
        J=$(cat "$I" |cut -d' ' -f1)
        K=$(echo "$I"|perl -pe 's/.md5$//g')
        echo "$J:$K"
     done | sort > /tmp/server-side-data-sorted.txt
cat /tmp/server-side-data-sorted.txt \
    |while read I ; do
         MD5=$(echo "$I"|cut -d':' -f1)
         NAME=$(echo "$I"|cut -d':' -f2-)
         COUNT=$(grep "$MD5" /tmp/server-side-data-sorted.txt|wc -l)
         if test $COUNT -gt 1 ; then grep "$MD5" /tmp/server-side-data-sorted.txt ; fi
     done

Other Data Repositories

server_datasets.txt ยท Last modified: 2010/09/19 13:40 by mario
Contact: admin(a)xuvtools.org