Arvind has been helping students in gate preparation since 2005. He completed his MS in 2007 from Indian Institute of Science (Computer Science and Automation). He worked in software development at Symantec and Nvidia . His areas of expertise are algorithms, operating systems, compilers and database management.
Posts made by arvind
RE: Query optimization
The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be returned. Semi-joins are written using the EXISTS or IN constructs.
RE: Query optimization
If the subquery returns at least one row, the result of EXISTS is true. In case the subquery returns no row, the result is of EXISTS is false.
EXISTS is often used with the correlated subquery.
The result of EXISTS depends on whether any row returned by the subquery, and not on the content of the rows. Therefore, the columns that appear on the SELECT clause of the subquery are not important.
For this reason, the common coding convention is to write EXISTS in the following form:
column_2 = table_1.column_1);
Note that if the subquery returns NULL, the result of EXISTS is true.
The four phases of (SQL) statement processing in System R are:
We’re concerned here with the optimization phase, and in particular with access path selection. An access path is a way of accessing the tuples of a relation – there is always the possibility of a full (data) segment scan, but there may also be one or more indices.
Learning program in Data Science
We are happy to introduce mentors for personalized learning in Data Science. This is a three week long program aimed to clarify concepts encountered when learning data science. Depending on the background of the candidate, there will be a personalized learning path and recommendations for project. Candidates get chance to interact 1-1 with the mentors. Register for a free assessment of your data science skils here - https://goo.gl/forms/bILW1cOpoaQFlyMw2. The assessment would be conducted on Jan 1-6 and course starts on Jan 7
Getting started in data science
Intro to data science
Job search strategies
Machine learning advanced topics
Creating a data science resume
Effective interviewing in data science
Bargava Subramanian is an India-based data scientist at Cisco Systems. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimisation in Python and R around the world. Bargava holds a master’s degree in statistics from the University of Maryland at College Park
Ashutosh Trivedi - working on his startup and he has several years of experience in the domains of Machine Learning and Data Science. He did his Masters from IIITB specializing in ML and headed data analytics at Grabhouse. Ashutosh in his own words " My primary interest area are Machine Learning, Deep Learning, Natural Language Processing and Distributed Computing. I like sharing knowledge through talks, trainings workshops and open source contributions. Follow my codes at https://github.com/codeAshu"
Rashmi Vishwakarma worked as Assistant Professor for five years. In her own words - I have published research papers in International and national journals and participated in International and National Conferences, Seminars. I had attended many “Faculty Development Program”. I feel as a teacher it is my role to motivate students to overcome their inhibitions and shine, encourage them to sharpen their talents and groom them to face the world with confidence.
DETAILED LIST OF TOPICS
PREDICTIVE MODELING - SEGMENTATION
Introduction to Segmentation
Types of Segmentation
Heuristic Segmentation Techniques
Behavioral Segmentation Techniques (K-Means Cluster Analysis)
Cluster evaluation and profiling
Interpretation of results - Implementation on new data
PREDICTIVE MODELING - DECISION TREES
Decision Trees - Introduction - Applications
Types of Decision Tree Algorithms
CHAID Vs. CART
Decision Trees - Validation
Overfitting - Best Practices to avoid
Implementation of Solution
PREDICTIVE MODELING - LINEAR REGRESSION
Linear Regression - Introduction - Applications
Assumptions of Linear Regression
Building Linear Regression Model
PREDICTIVE MODELING - LOGISTIC REGRESSION
Logistic Regression - Introduction - Applications
Linear Regression Vs. Logistic Regression Vs. Generalized Linear Models
Building Logistic Regression Model
Statistical learning vs. Machine learning
Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
Concept of Overfitting and Under fitting
Types of Cross validation
REGRESSION & CLASSIFICATION MODEL BUILDING
Recursive Partitioning(Decision Trees)
Ensemble Models (Random Forest, Bagging & Boosting)
Staring with Hadoop and Spark
Start by setting up a simple Hadoop cluster. Hortonworks has Ambari to ease the setup of the cluster.
The next step is understanding MapReduce. Google's original paper is a good start. Before heading head-first into the horrible Java API, you need to understand what Map and Reduce are and be able to deconstruct a variety of data processing tasks into a Map and a Reduce
Figure out the core classes in the API. You've got Mapper, and Reducer which you will be extending with your code. You've got input data types which control how the input file is viewed by Hadoop. You've got wrappers around individual values (LongWritable, Text, etc...).
Run a few test examples. Hadoop comes with a bunch of examples - you can compile and run them. Then start playing with the code in them, changing input files, input file format, and eventually, changing what map() and reduce() methods do.
There is a lot of stuff on top of MapReduce that's been built over the past few years. Of interest to you is Spark. Learn a bit of Scala first, it will help. Spark works by creating transformation pipelines for your data giving you more "atomic" operations than just map and reduce. You basically string these operations in the Scala code. Transformations in Spark are lazy - they are only executed when an action is triggered (an action is something like "save"). Once triggered, Spark creates internally execution strategy (think RDBMS query optimization) that produces the results fast. Defining transformations in Scala will be much easier than building MapReduce jobs in Java's Hadoop API.