- Data Science Projects with Python
- Stephen Klosterman
- 572字
- 2024-10-30 02:13:00
Different Types of Data Science Problems
Much of your time as a data scientist is likely to be spent wrangling data: figuring out how to get it, getting it, examining it, making sure it's correct and complete, and joining it with other types of data. pandas will facilitate this process for you. However, if you aspire to be a machine learning data scientist, you will need to master the art and science of predictive modeling. This means using a mathematical model, or idealized mathematical formulation, to learn the relationships within the data, in the hope of making accurate and useful predictions when new data comes in.
For this purpose, data is typically organized in a tabular structure, with features and a response variable. For example, if you want to predict the price of a house based on some characteristics about it, such as area and number of bedrooms, these attributes would be considered the features and the price of the house would be the response variable. The response variable is sometimes called the target variable or dependent variable, while the features may also be called the independent variables.
If you have a dataset of 1,000 houses including the values of these features and the prices of the houses, you can say you have 1,000 samples of labeled data, where the labels are the known values of the response variable: the prices of different houses. Most commonly, the tabular data structure is organized so that different rows are different samples, while features and the response occupy different columns, along with other metadata such as sample IDs, as shown in Figure 1.9.
Figure 1.9: Labeled data (the house prices are the known target variable)
Regression Problem
Once you have trained a model to learn the relationship between the features and response using your labeled data, you can then use it to make predictions for houses where you don't know the price, based on the information contained in the features. The goal of predictive modeling in this case is to be able to make a prediction that is close to the true value of the house. Since we are predicting a numerical value on a continuous scale, this is called a regression problem.
Classification Problem
On the other hand, if we were trying to make a qualitative prediction about the house, to answer a yes or no question such as "will this house go on sale within the next five years?" or "will the owner default on the mortgage?", we would be solving what is known as a classification problem. Here, we would hope to answer the yes or no question correctly. The following figure is a schematic illustrating how model training works, and what the outcomes of regression or classification models might be:
Figure 1.10: Schematic of model training and prediction for regression and classification
Classification and regression tasks are called supervised learning, which is a class of problems that relies on labeled data. These problems can be thought of a needing "supervision" by the known values of the target variable. By contrast, there is also unsupervised learning, which relates to more open-ended questions of trying to find some sort of structure in a dataset that does not necessarily have labels. Taking a broader view, any kind of applied math problem, including fields as varied as optimization, statistical inference, and time series modeling, may potentially be considered an appropriate responsibility for a data scientist.