Jump to content

Help:WordToWiki

From Meta, a Wikimedia project coordination wiki

Here are some tools that may be helpful in converting Microsoft Word files to wiki markup:

The conversion can also be done using a two step process using Perl. This requires the following software:

doc2mw:

 #!/bin/bash
 #       doc2mw - Word to MediaWiki converter
 
 FILE=$1
 TMP="$$-${FILE}"
 
 if [ -x "./html2mw" ]; then
         HTML2MW='./html2mw'
 else
         HTML2MW='html2mw'
 fi
 
 wvHtml --targetdir=/tmp "${FILE}" "${TMP}"
 
 # Remove extra divs
 perl -pi -e "s/\<div[^\>]+.\>//gi;" "/tmp/${TMP}"
 
 ${HTML2MW} "/tmp/${TMP}"
 rm "/tmp/${TMP}"

html2mw:

 #!/usr/bin/perl
 #       html2mw - HTML to MediaWiki converter
 
 use HTML::WikiConverter;
 
 my $b;
 while (<>) { $b .= $_; }
 
 my $w = new HTML::WikiConverter( dialect => 'MediaWiki' );
 
 my $p = $w->html2wiki($b);
 
 # Substitutions to get rid of nasty things we don't need
 $p =~ s/<br \/>//g;
 $p =~ s/\&nbsp\;//g;
 print $p;

Disclaimer: These scripts are probably not the best way to do this, only a *possible* way to do this. Please feel free to improve them.

All in one version

[edit]
 #!/usr/bin/perl -w
 
 use HTML::WikiConverter; 
 
 my $word_doc = shift or die;
 my $wvHtml = "/usr/bin/wvHtml";
 
 open( PIPE, "$wvHtml $word_doc -|" ) or die;
 my $html_stage1 = "";
 while( <PIPE> ) {
     $html_stage1 .= $_;
 }
 
 $html_stage1 =~ s/\<div[^\>]+.\>//gi; # Remove extra div's
 my $w = new HTML::WikiConverter( dialect => 'MediaWiki' );
 my $p = $w->html2wiki( $html_stage1 );
 
 # Clean up $p...
 $p =~ s/<br \/>//g;
 $p =~ s/\&nbsp\;//g;
 print $p;