Jump to content

Analytics

From Wikitech

Analytics is the systematic computational analysis of data or statistics, for the purposes of discovery, interpretation, and communication of meaningful patterns.

In the context of the Wikimedia Foundation, the term Analytics generally refers to work carried out on the Analytics Cluster and the Data Lake by various WMF staff and volunteers.

The Data Platform Engineering team has responsibility for managing the Analytics Cluster and the Data Lake, so most pages under /Analytics are now of historical interest only.

Analytics Cluster

The Analytics Cluster comprises a number of different systems geared to help researchers, data scientists, machine learning engineers and other authorized parties to access the data lake.

If you believe that you need access to the cluster, please refer to Data Platform/Data access

Data Lake

The term Data Lake refers to the set of data files (also referred to as datasets) that are stored on the Hadoop HDFS file system.

Many of these datasets are managed by the Data Platform Engineering team with pipelines deployed to production and monitored.

However, members of the analytics-privatedata-users group may also create their own data files in Hadoop, enabling custom Hive tables plus manipulation of data from Jupyter and Spark etc.

See also