BIG DATA Projects


Year 2017-2018

  1. Hamza Ouhaichi, This project present the implementation of email spam classification using Spark and Java, the classification is based on Naive Bayes’ approach, so in a first time we wrote the implementation from scratch using only simple methods from Spark, then we used “NaiveBayesModel” from MLlib. For this classification we used the the emails subject, because when using the whole body of the email the accuracy get too low. We made us of Gradle for packaging the Java application which, after run, it is accessible from the browser.
    Get it from GitHub
  2. Andrea Medda and Alessio Pili, In this project we present Price Probe, a suite of software built to perform a forecast on Amazon’s items prices. Price Probe’s primary aim is to predict, for each item, its future price trend making a customized forecast obtained by fine-tuning ARIMA’s parameters and selecting the best combination of exogenous features which lead to the lowest Mean Absolute Percentage Error Score. The method used to make such forecasting is ARIMA since it is well-suited for time series studies and allows adding external features to the model via exogenous variables. By using such method and a given set of exogenous variables, we achieved a low global Mean Absolute Percentage Error Score of 2.43\%. Since all of Amazon’s data are not open, working with it is quite expensive so, with more resources, it would surely be possible to achieve even better results.
    Get it from GitHub
  3. Simone Cusimano and Lodovica Marchesi, In this project we present an online system that extracts Facebook pictures of visitors and leverages cognitive computing to classify them in one of four classes: whether one or more male persons are present, or one or more female persons are present, or a group of both male and female persons, or if there are no person present at all.
    The system has been trained with 300 images which have been manually annotated in one of the four categories above. Microsoft cognitive services have been employed to extract tags from the images. Tags are text elements that have been adopted to form the vectorial space using the bag of word or TFIDF representation.
    Our evaluation on 300 images using a 10-cross-validation indicates an accuracy of 51\% for the presented multi-class problem indicating the power that can be exploited from cognitive computation tools. Moreover, our system allows extracting all the personal images of any visitors logged in to Facebook through Facebook APIs.
    Get it from GitHub
  4. Ilyas Chaoua, Large amounts of data becomes difficult to access what we are looking for. So, we need tools and techniques to organize, search and understand vast quantities of information. Topic modelling provides us with methods to organize, understand and summarize large collections of textual information. It helps in: (i) Discovering hidden topical patterns that are present across the collection. (ii) Annotating documents according to these topics. (iii) Using these annotations to organize, search and summarize texts.

    One of the famous algorithm of topic modelling is Latent Dirichlet Allocation (LDA), it is a widely used topic modelling technique and the TextRank process: a graph-based algorithm to extract relevant key phrases. This project aims to explain how to use LDA within Spark 2.2 to summurize the dataset in input (newset.csv) and explore the topics and terms that correlate together.
    Get it from GitHub

credits | accessibilità Università degli Studi di Cagliari
C.F.: 80019600925 - P.I.: 00443370929
note legali | privacy

Nascondi la toolbar