Jump to content

Wikipedia:Reference desk/Archives/Computing/2015 November 16

From Wikipedia, the free encyclopedia
Computing desk
< November 15 << Oct | November | Dec >> November 17 >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


November 16[edit]

What makes cosine similarity useful for classifying documents?[edit]

Why use the cosine similarity to know how similar two documents are? That is, (A * B) / (||A|| * ||B||). A and B are sequences of word frequencies.

Couldn't you just make a table of words + frequencies for each document and subtract the value of doc A from the value of doc B? More differences in these would imply more general difference. That is A - B, A and B are sequences of word frequencies. --3dcaddy (talk) 19:01, 16 November 2015 (UTC)[reply]

You need some historical background for a complete answer... The Jaccard index (also known as a coefficient) was popular for measuring similarity and diversity among two different sets. Flowers are used often as an example of Jaccard index. My set might contain color and length of stem. Yours might have color and number of pedals. We compare based on what we both have and get a measure of similarity. It is also important to note that Jaccard index handles sparsity very well. If I forgot to write down color for half the flowers in my set, it still works for those that actually have data. Next... what if we are working strictly with binary data. Every field is a yes/no answer. It is still two sets containing different columns and there is still a lot of sparsity. The Tanimoto coefficient is an algebraic form of the set theory Jaccard index for binary sets. It is popular and common. But, it has some overhead that can be simplified. If you know the algebraic form of cosine, you will see that it is very similar to the Tanimoto formula. So, why not throw out the complexity and use cosine instead? You nearly get the Tanimoto coefficient, which is the same as the Jaccard index for binary sets. Jaccard index is already accepted, so using cosine is nearly accepted. That is the history. Why do we not just count the differences? If I say the differences count up to 132, what does that mean? Nothing really. You need to confine the answer to a range so I know the minimum and maximum values. We know that cosine is -1 to 1. If you tell me the different is 1, that is the max value. I only have to ask if the 1 means absolutely different or absolutely the same. By convention, 1 means absolutely the same and 0 means completely different (-1 means the exact opposite, but that makes no sense in most examples). If that doesn't answer the question, please ask for what I missed. I don't want to flood you with even more details that you don't think are pertinent. 162.211.46.242 (talk) 20:31, 16 November 2015 (UTC)[reply]
The short version: Each document is modeled as a vector in a high-dimensional vector space. When you have two vectors that point in nearly the same direction, they are more "similar" than vectors that are orthogonal or antiparallel, or just point in rather different directions. See also document modelling. (or maybe not, that's a rather pitiful stub) SemanticMantis (talk) 21:33, 16 November 2015 (UTC)[reply]
Thanks for the answers, now I get it. It didn't realize that counting the number of differences, instead of finding a range of values, was quite off the track. Is there any literature about such issues, that intuitively make sense, but are mathematically no-good? --3dcaddy (talk) 21:42, 16 November 2015 (UTC)[reply]
SemanticMantis is correct. Cosine similarity is strictly the cosine of the angle between the two vectors. The relative vector lengths are ignored. If you analyze the Tanimoto formula, you will recognize that it takes into account the distance between the end-points of the vectors. So, if the angle is 10 degrees, cosine is always the cosine of 10 degrees. With Tanimoto, if the vector lengths are nearly the same, the distance between the end points will be near minimal and give me a higher similarity value. If one vector is much longer than the other, the angle remains the same, but the distance between the end points is much longer and the similarity is reduced. If the cosine similarity is good enough for you, then use it. If it turns out that some vectors are extremely long and others are extremely short, you will want to use Tanimoto difference to account for that aspect of measuring similarity. Then, if you are still refining your algorithms, you can look into SVD (which I personally don't like) or convert your data into ordered strings and make a big jump into string-based similarity. 209.149.115.177 (talk) 14:50, 17 November 2015 (UTC)[reply]

Super Mario Maker[edit]

After watching YouTube videos of Super Mario Maker, I've become interested in actually buying a Wii U just to play it. It would be the first video game made in the last two decades that I would actually consider buying.

However, I'm afraid I've fallen hopelessly out of touch of modern video games. The last time I bought new video games, they were on real, physical, honest-to-gosh floppy disks put inside a nice, beautiful cardboard box I could take home, open up, and put into my computer.

I figure that these days, actual physical storage media is like Soooo last millennium! What are you, Methusalem?. So, how would I actually go about buying and installing the game? JIP | Talk 19:25, 16 November 2015 (UTC)[reply]

You can either buy a physical Wii U Optical Disc retail, or you can download from the Nintendo eShop. -- Finlay McWalterTalk 19:34, 16 November 2015 (UTC)[reply]
In case a physical disc isn't available, how would I go about downloading it? Can I do it solely using the Wii U? I already have a wired (i.e. non-wireless) broadband Internet connection, which I'm using right now to write this message. Can I use that on the Wii U? Does it have an Ethernet cable connection? How do I pay for it? Can I just use my credit card or do I have to set up some sort of new-fangled subscription account? JIP | Talk 19:40, 16 November 2015 (UTC)[reply]
The Wii U doesn't have an ethernet port, and most people use its Wifi capability; if you can't do that, you can buy a USB Wii LAN adapter which plugs into the Wii U's ethernet port. You can use a credit card with the eShop (and maybe a debit card); you can also buy physical gift cards (they're just plastic cards with numbers on them) in supermarkets which give eShop credit. -- Finlay McWalterTalk 20:05, 16 November 2015 (UTC)[reply]
You said both "the Wii U doesn't have an ethernet port and "plugs into the Wii U's ethernet port". Which is it? Or did you mean "the Wii U's USB port"? In any case, I might be better off finally buying and installing a WiFi device in my apartment. So far I've had no use for one as the only Internet-capable device I ever use is my computer, which uses the wired Internet connection, which I presume is both faster and more secure than a WiFi connection. JIP | Talk 20:23, 16 November 2015 (UTC)[reply]
Oops, yes, I meant USB port. It's just a usb<->ethernet dongle. I see them for about £10 on Amazon. -- Finlay McWalterTalk 20:29, 16 November 2015 (UTC)[reply]
Furthermore, once I get it, I'd like to play other people's levels. Can these be downloaded free of charge or are there additional charges for them? JIP | Talk 19:42, 16 November 2015 (UTC)[reply]
As I understand it, when someone is finished designing a level they tell others the level's ID code (where the ID is a 16-hex-digit number). Here's an example of people posting their IDs in the SuperMarioMaker reddit: https://www.reddit.com/r/MarioMaker/comments/3t164h/level_of_the_week_8_factory_submissions_last/ -- Finlay McWalterTalk 20:09, 16 November 2015 (UTC)[reply]
(edit conflict) There's also the 10 Mario/100 Mario Challenge which chooses several levels at random from the ones users have uploaded, and you have either 10 or 100 lives to get through them all (or skip ones that might be very difficult). I believe that after playing each level it saves snapshots of those levels in your game so that you can look at them in the level designer afterward. FrameDrag (talk) 20:19, 16 November 2015 (UTC)[reply]
You might appreciate SuperBunnyHop's review of a bunch of SMM levels submitted to him here. I think it gives a reasonable idea about what is, and crucially what isn't, possible in SMM. As constructing-stuff-in-the-game type games go, it looks to be considerably inferior to Little Big Planet and especially Minecraft. -- Finlay McWalterTalk 20:13, 16 November 2015 (UTC)[reply]
Wii U is intended to be easy and idiot proof, and it succeeds admirably at those goals. You will have no problems buying (officially licensed) games as physical media or online. For inspiration on SMM, see Bananasaurus Rex's playthrough of this insane level [1] :) SemanticMantis (talk) 21:30, 16 November 2015 (UTC)[reply]