COURSES An MOOC Search Engine Ruilin Xu [email protected]

Abstract As there are hundreds of thousands of massive open online courses (MOOCs) over the internet, with more and more every day, it is quite cumbersome for users to search for the specific courses wanted efficiently. Usually people need to repeat the same search on multiple MOOC websites such as Coursera[4] and Udacity[5] in order to find what they are looking for. As a result, COURSES has been developed. COURSES is a MOOC search engine which gathers around 4000 courses from many major MOOC websites such as Coursera[4], Udacity[5], Khan Academy[6], Udemy[7] and Edx [8]. COURSES also realizes faceted search, where all search results are categorized into multiple categories such as course prices, length, workload, instructors, and course categories. Users are able to filter using these categories to target specific courses very efficiently by combining a search query and multiple filters.

Introduction Nowadays, people love to enroll in massive open online courses (MOOCs) because of its convenience. As MOOCs become more and more popular, the number of MOOCs increases dramatically. As a result, numerous MOOC websites emerged. Unfortunately, with the number of MOOC websites increasing rapidly, it is more difficult for users to find the courses they want. They often need to repeat their searches again and again on different MOOC websites, trying to find the perfect result. This is both cumbersome and inefficient. Is there a way to search only once and easily find the courses we want from all the MOOC websites? With that question in mind, I was also inspired by Assignment 3; I found that my thought was not impossible to realize. Through discussion with the professor, I decided to create COURSES, a vertical MOOC search engine, powered by Apache Solr, to solve the problem and to apply what I learned from class into the real life. This search engine greatly simplifies the search for suitable online courses, because it combines courses from five sites (i.e., Coursera [4], Udacity[5], Khan Academy[6], Udemy[7], and Edx [8]), eliminating the need for users to hop from site to site. Users are also able to filter based on what they are looking for. There are similar websites such as RedHoop [1], MOOC-List[2], Class-Central[3], etc. But they all have their downsides. They either do not incorporate as much data as COURSES or do not display as much information about the courses as the user needs. They only display the title of the course and a brief 1

course introduction. The user cannot see any other attributes of the course, such as its instructors, its price, its length etc. As a result, COURSES is both very meaningful and useful. COURSES has a big database, around 4,000 courses as its data. COURSES also have many useful faceted filters for users to target search results efficiently. COURSES also displays all of the useful attributes of the courses within the search results so that users can see everything in one location, which can save users a significant amount of time in determining which courses to take.

Related Work COURSES is inspired by Assignment 3 from our CS 410 course. Assignment 3 is basically a simple tutorial on building a simple search engine from some data sources. COURSES is similar to Assignment 3, because it uses a similar technique when parsing data. Both use the combination of Ruby and JavaScript files. The difference is that COURSES uses JavaScript parsers that are much more complicated than the one used in Assignment 3. COURSES’ parsers are web-specific, meaning that they are able to parse different data from different MOOC websites. After parsing data, the data will then be converted to XML files, which are compatible with Apache Solr. There is another well-implemented MOOC search engine called RedHoop[1]. RedHoop[1] and COURSES are very similar in the sense that they are both multi-site search engines, meaning that they all get course data from various different MOOC websites. In addition, they both have the functionality of faceted search. However, COURSES improved greatly in terms of displaying search results. RedHoop [1] only displays the course title with a brief introduction of the course, whereas COURSES display much more useful information such as price, course length, estimated workload, course language and instructors’ information. COURSES also has more search faceted filters such as course language and instructors, which are very important for international users who want to take courses in their own language and for users who have a strong preference over certain instructors.

2

Problem Definition The challenge that I solved was to create an online search engine for courses that draws from multiple sources and assists the user to more efficiently search using different facets. The input for the user is his search query and any filters they want to apply. The expected output is the list of search results, sorted by relevance. Building this search engine has four sub-challenges/stages, which I will now enumerate.

1. Data Crawling & Parsing Because I needed to aggregate all the data from various sources, the first problem faced was how to best parse and get the data. As we all know, each website has its own data format which might be very different from others. For example, courses from Coursera [4] don’t have a price field, since all of them are free. On the other hand, however, courses from Udacity[5] do have prices listed.

2. Data Processing & Consolidation With all the data correctly parsed, another problem immediately emerged. Since I was building a single source search engine which can handle data from numerous different websites, I needed to design a good data structure which could easily take and consolidate all the data. I needed to think about what key attributes a course should have so that I could use and apply this structure on all the websites COURSES gets data from.

3. Data Formatting & Outputting After consolidating the data, it’s time to output the data into Apache Solr. To do this, I needed to find a way that could easily convert the raw data into the standard format such as XML or JSON files that Apache Solr can read.

4. User Interface Design & Implementation The last step, after inserting all the data into Apache Solr, was to design and implement a great user interface which is sufficiently informative and correctly displays all the important data users need in an aesthetically pleasing way.

3

Methods According to the problems mentioned in the above section, I will provide my solution in detail here. COURSES is based on Apache Solr, which is a great framework for building vertical search engines. Although personally I think Solr is not that easy to use, since it doesn’t have much detailed introductory documentation. To get the wanted data, I needed to write my own parser. I did that based on the crawler and parser from our Assignment 3. I also modified it so that it generates the data XML files that Solr can read. For the front end user interface, I needed to make sure the data is correctly displayed and the user interface is pleasant to the eye.

1. Data Crawling & Parsing As mentioned above, since I needed to get data from all kinds of MOOC websites, I designed a websitespecific parser for each website, so that I am able to get all the useful data correctly. By reading deeply into each website via inspection tool, I came up with the following table of commands, which correctly parses the data from each website: Coursera: Title Website Length Workload

Language Instructor

Instructor intro Course categories Course intro Course body

document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("iconcalendar")[0].parentNode.childNodes[1].innerText document.body.getElementsByClassName("icontime")[0].parentNode.childNodes[1].innerText + document.body.getElementsByClassName("icontime")[0].parentNode.childNodes[2].innerText document.body.getElementsByClassName("iconglobe")[0].parentNode.childNodes[1].innerText document.body.getElementsByClassName("coursera-course2-instructorsprofile")[i].childNodes[2].childNodes[0].getElementsByTagName("span")[0].inn erText – iterate i document.body.getElementsByClassName("coursera-course2-instructorsprofile")[i].childNodes[2].childNodes[1].getElementsByTagName("span")[0].inn erText – iterate i document.body.getElementsByClassName("coursera-coursecategories")[0].getElementsByTagName("a")[i].innerText – iterate i document.body.getElementsByClassName("span6")[0].innerText document.body.getElementsByClassName("span7")[0].innerText

Edx: Title Website Length Workload Instructor

document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("course-detaillength")[0].innerText.substring("Course Length: ".length) document.body.getElementsByClassName("course-detaileffort")[0].innerText.substring("Estimated effort: ".length) document.body.getElementsByClassName("stafflist")[0].getElementsByTagName("li")[i].childNodes [3].childNodes[1].innerText – iterate i

4

Instructor intro Course intro Course body

document.body.getElementsByClassName("stafflist")[0].getElementsByTagName("li")[i].childNodes [3].childNodes[3].innerText – iterate i document.body.getElementsByClassName("course-detail-subtitle copylead")[0].innerText document.body.getElementsByClassName("course-section course-detailabout")[0].innerText + document.body.getElementsByClassName("view -display-iderrata")[0].innerText – second part might not exist

Khan: Title Website Course intro Course body

document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("topic-desc")[0].innerText document.getElementById("page-container-inner").innerText

Udacity: Title Website Price Length

Workload

Instructor

Instructor intro

Course intro Course body

document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("price-information")[0].innerText (if contains “null”, then free) document.body.getElementsByClassName("durationinformation")[0].getElementsByClassName("col-md10")[0].getElementsByTagName("strong")[0].innerText.substring("Approx. ".length) document.body.getElementsByClassName("durationinformation")[0].getElementsByClassName("col-md10")[0].getElementsByTagName("small")[0].getElementsByTagName("p")[0].innerT ext.substring("Assumes ".length) document.body.getElementsByClassName("row row-gap-medium instructorinformation-entry")[i].childNodes[2j1].childNodes[1].getElementsByTagName("h3")[0].innerText – iterate i, j (1, 2) document.body.getElementsByClassName("row row-gap-medium instructorinformation-entry")[i].childNodes[2j1].childNodes[3].getElementsByTagName("p")[0].innerText – iterate i, j (1, 2) document.body.getElementsByClassName("col-md-8 col-md-offset2")[1].getElementsByClassName("pretty-format")[0].innerText document.body.getElementsByClassName("col-md-8 col-md-offset2")[i].innerText – iterate i

Udemy: Title Website Price Length

document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("pb-p")[0].getElementsByClassName("pbpr")[0].innerText document.body.getElementsByClassName("wi")[0].getElementsByCla ssName("wili")[1].innerText.replace(" of high quality content", "")

5

Instructor

Instructor intro Course intro Course body

document.body.getElementsByClassName("tbli")[i].childNodes[1].getElementsByClassName("tbr")[0].getElementsByTagName("a")[0].innerText – iterate i document.body.getElementsByClassName("tbli")[i].childNodes[3].getElementsByTagName("p")[0].innerText – iterate i document.body.getElementsByClassName("ci-d")[0].innerText document.body.getElementsByClassName("mc")[0].innerText

2. Data Processing & Consolidation With the above data parsed, I next designed a structure which can hold all attributes of a course and fit data from all websites. The following table is the result: URL Title Website Price Length

Coursera Given Parsed Parsed DEFAULT: FREE Parsed

Edx Given Parsed Parsed DEFAULT: FREE Parsed

Workload

Parsed

Parsed

Language

Parsed

Instructor

Parsed

DEFAULT: Undefined Parsed

Instructor intro Course categories Course intro Course body

Parsed

Parsed

Parsed

DEFAULT: Undefined Parsed Parsed

Parsed Parsed

Khan Given Parsed Parsed DEFAULT: FREE DEFAULT: Undefined DEFAULT: Undefined DEFAULT: Undefined DEFAULT: Undefined DEFAULT: Undefined DEFAULT: Undefined Parsed Parsed

6

Udacity Given Parsed Parsed Parsed Parsed

Udemy Given Parsed Parsed Parsed Parsed

Parsed

DEFAULT: Undefined Parsed

DEFAULT: Undefined DEFAULT: Undefined Parsed

Parsed

Parsed

DEFAULT: Undefined Parsed Parsed

DEFAULT: Undefined Parsed Parsed

3. Data Formatting & Outputting Apache Solr has its own rules of data files. I chose to use its XML rules. With the consolidated data above, I was then able to create data XML files (one file for each website). After trying all kinds of options, I finally decided to output data while reading in the data and processing it. The following code snippet is excerpted from one of the parsers: var length; try { length = "" + document.body.getElementsByClassName("iconcalendar")[0].parentNode.childNodes[1].innerText.trim().replace(/& /g, '&').replace(//g, '>').replace(/"/g, '"').replace(/'/g, ''') + "\n\t"; } catch (err) { length = "Undefined\n\t"; }

The above code snippet shows the method of getting parsed data, processing it, and then outputting it into the correct XML format. The data is obtained by using the command shown in the tables above in the “Data Crawling & Parsing” section. COURSES trims out unnecessary characters, replaces some special characters such as “